Posts tagged ‘R’

Resampling Approach to Power Analysis

A coauthor and I are doing power analyses based on a pilot test and we quickly realized that it’s really hard to calculate a power analysis for anything much more exotic than a t-test of means. As I usually do when there’s no obvious solution, I decided to just brute force it with a Monte Carlo approach.

The algorithm is as follows:

  1. Start with pilot data
  2. Draw a sample with replacement of the target sample size n and do this trials times
  3. Run the estimation with each of these resampled datasets and see if enough of them achieve significance (with conventional power and alpha this means 80% have p<.05)

In theory, this approach should allow a power analysis for any estimator, no matter how strange. I was really excited about this and figured we could get a methods piece out of it until my coauthor pointed out Green and MacLeod beat us to it. Nonetheless, I figured it’s still worth a blogpost in case you want to do something that doesn’t fit within the lme4 package, around which Green and MacLeod’s simr package wraps the above algorithm. For instance, as seen below, I show how you can do a power analysis for a stratified sample.

Import pre-test data

In a real workflow you would just use some minor variation on pilot <- read_csv("pilot.csv"). But as an illustration, we can assume a pilot study with a binary treatment, a binary DV, and n=6. Let us assume that 1/3 of the control group are positive for the outcome but 2/3 of the treatment group. Note that a real pilot should be bigger – I certainly wouldn’t trust a pilot with n=6.

treatment <- c(0,0,0,1,1,1)
outcome <- c(0,0,1,0,1,1)
pilot <- as.data.frame(cbind(treatment,outcome))

I suggest selecting just the variables you need to save memory given that you’ll be making many copies of the data. In this case it’s superfluous as there are no other variables, but I’m including it here to illustrate the workflow.

pilot_small <- pilot %>% select(treatment,outcome)

Display observed results

As a first step, do the analysis on the pilot data itself. This gives you a good baseline and also helps you see what the estimation object looks like. (Unfortunately this varies considerably for different estimation functions).

est_emp <- glm(outcome ~ treatment, data = pilot, family = "binomial") %>% summary()
z_emp <- est_emp[["coefficients"]][,3]
est_emp
## 
## Call:
## glm(formula = outcome ~ treatment, family = "binomial", data = pilot)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -0.9005  -0.9005   1.4823  -1.4823   0.9005   0.9005  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.6931     1.2247  -0.566    0.571
## treatment     1.3863     1.7321   0.800    0.423
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.6382  on 4  degrees of freedom
## AIC: 11.638
## 
## Number of Fisher Scoring iterations: 4

Note that we are most interested in the t or z column. In the case of glm, this is the third column of the coefficient object but for other estimators it may be a different column or you may have to create it manually by dividing the estimates column by the standard deviation column.

Now you need need to see how the z column appears when conceptualized as a vector. Look at the values and see the corresponding places in the table. I wrapped it in as.vector() because some estimators give z as a matrix.

as.vector(z_emp)
## [1] -0.5659524  0.8003776

In this case, we can see that z_emp[1] is z for the intercept and z_emp[2] is z for the treatment effect. Obviously the latter is more interesting.

Set range of assumptions for resamples

Now we need to set a range of assumptions for the resampling.

trials is the number of times you want to test each sample size. Higher values for trials are slower but make the results more reliable. I suggest starting with 100 or 1000 for exploratory purposes and then going to 10,000 once you’re pretty sure you have a good value and want to confirm it. The arithmetic is much simpler if you stick with powers of ten.


nrange
 is a vector of values you want to test out. Note that the z value for your pilot gives you a hint. If it’s about 2, you should try values similar to those in the pilot. If it’s much smaller than 2, you should try values much bigger.

trials <- 1000 # how many resamples per sample size
nrange <- c(50,60,70,80,90,100) # values of sample size to test

Set up resampled data, do the regressions, and store the results

This is the main part of the script.

It creates the resampled datasets in a list called dflist. The list is initialized empty and dataframes are stored in the list as they are generated.

Store z scores and sample size in matrix results.

dflist <- list()
k <- 1 # this object keeps track of which row of the results object to write to
results <- matrix(nrow = length(nrange)*trials, ncol=3) %>% data.frame() #adjust ncol value to be length(as.vector(z_emp))+1
colnames(results) <- c('int','treatment','n') # replace the vector with names for as.vector(z_emp) positions followed by "n." The nanes need not match the names in the regression table but should capture the same concepts.

for (i in 1:length(nrange)) {
  dflist[[i]] <- list()
  for (j in 1:trials) {
    dflist[[i]][[j]] <- sample_n(pilot_small,size=nrange[i],replace=T)
    est <- glm(outcome ~ treatment, data = dflist[[i]][[j]], family = "binomial") %>% summary() # adjust the estimation to be similar to whatever you did in the "test estimation" block of code, just using data=dflist[[i]][[j]] instead of data=pilot
    z <- est[["coefficients"]][,3] # you may need to tweak this line if not using glm
    results[k,] <- c(as.vector(z),nrange[i])
    k <- k +1
  }
}

# create vector summarizing each resample as significant (1) or not significant (0)
results$treatment.stars <- 0
results$treatment.stars[abs(results$treatment)>=1.96] <- 1

Interpret results

Dist of Z by sample size

As an optional first step, plot the distributions of z-scores across resamples by sample size.

results %>% ggplot(mapping = aes(x=treatment)) + 
  geom_density(alpha=0.4) + 
  theme_classic() + 
  facet_wrap(~n)

Number of Significant Resamples for Treatment Effect

Next make a table for what you really want to know, which is how often resamples of a given sample size gives you statistical significance. This rate can be interpreted as power.

table(results$n,results$treatment.stars)
##      
##         0   1
##   50  355 645
##   60  252 748
##   70  195 805
##   80  137 863
##   90  105 895
##   100  66 934

As you can see, n=70 seems to give about 80% power. To confirm this and get a more precise value, you’d probably want to run the script again but this time with nrange <- c(67,68,69,70,71,72,73) and trials <- 10000.

Stratified Samples

50/50 Strata (Common for RFTs)

Note that you can modify the approach slightly to have stratified resamples. For instance, you might want to ensure an equal number of treatment and outcome cases in each resample to mirror a 50/50 random assignment design. (This should mostly be an issue for relatively small resamples as for large resamples you are likely to get very close to the ratio in the pilot test just by chance.)

To do this we modify the algorithm by first splitting the pilot data into treatment and control data frames and then sampling separately from each before recombining, but otherwise using the same approach as before.

pilot_control <- pilot_small %>% filter(treatment==0)
pilot_treatment <- pilot_small %>% filter(treatment==1)

trials_5050 <- 1000 # how many resamples per sample size
nrange_5050 <- c(50,60,70,80,90,100) # values of sample size to test

nrange_5050 <- 2 * round(nrange_5050/2) # ensure all values of nrange are even

dflist_5050 <- list()
k <- 1 # this object keeps track of which row of the results object to write to
results_5050 <- matrix(nrow = length(nrange_5050)*trials_5050, ncol=3) %>% data.frame() #adjust ncol value to be length(as.vector(z_emp))+1
colnames(results_5050) <- c('int','treatment','n') # replace the vector with names for as.vector(z_emp) positions followed by "n." The nanes need not match the names in the regression table but should capture the same concepts.


for (i in 1:length(nrange_5050)) {
  dflist_5050[[i]] <- list()
  for (j in 1:trials_5050) {
    dflist_5050[[i]][[j]] <- rbind(sample_n(pilot_control,size=nrange_5050[i]/2,replace=T),
                                   sample_n(pilot_treatment,size=nrange_5050[i]/2,replace=T))
    est <- glm(outcome ~ treatment, 
               data = dflist_5050[[i]][[j]], 
               family = "binomial") %>% 
               summary()
    z <- est[["coefficients"]][,3]
    results_5050[k,] <- c(as.vector(z),nrange_5050[i])
    k <- k +1
  }
}

results_5050$treatment.stars <- 0
results_5050$treatment.stars[abs(results_5050$treatment)>=1.96] <- 1

Number of Significant Resamples for Treatment Effect With 50/50 Strata

table(results_5050$n,results_5050$treatment.stars)
##      
##         0   1
##   50  330 670
##   60  248 752
##   70  201 799
##   80  141 859
##   90   91 909
##   100  63 937

Not surprisingly, our 80% power estimate is still about 70.

One Fixed Strata and the Other Estimated

Or perhaps you know the size of a sample in one strata and want to test the necessary size of another strata. Perhaps a power analysis for a hypothesis specific to strata one gives n1 as its necessary sample size but you want to estimate a power analysis for a pooled sample where n=n1+n2. Likewise, you may wish to estimate the necessary size of an oversample. Note that if you already have one strata in hand you could modify this code to work but should just use the data for that strata, not resamples of it.

For one fixed and one estimated strata, let’s assume our pilot test is departments A and B from the UCBAdmissions file, that we know we need n=500 for 80% power on some hypothesis specific to department A, and we are trying to determine how many we need for department B in order to pool and analyze them together. I specify dummies for gender (male) and a dummy for department A (vs B).

ucb_tidy <- UCBAdmissions %>% 
  as_tibble() %>% 
  uncount() %>% 
  mutate(male = (Gender=="Male"), 
         admitted = (Admit=="Admitted")) %>% 
  select(male,admitted,Dept)

ucb_A <- ucb_tidy %>% filter(Dept=="A") %>% mutate(depA=1)
ucb_B <- ucb_tidy %>% filter(Dept=="B") %>% mutate(depA=0)

n_A <- 500
trials_B <- 1000 # how many resamples per sample size
nrange_B <- c(50,100,150,200,250,300,350,400,450,500,550,600,650,700,750,800,850,900,950,1000) # values of sample size to test

dflist_B <- list()
k <- 1 # this object keeps track of which row of the results object to write to
results_B <- matrix(nrow = length(nrange_B)*trials_B, ncol=4) %>% data.frame() #adjust ncol value to be length(as.vector(z_emp))+1
colnames(results_B) <- c('int','male','depA','n') # replace the vector with names for as.vector(z_emp) positions followed by "n." The nanes need not match the names in the regression table but should capture the same concepts.

for (i in 1:length(nrange_B)) {
  dflist_B[[i]] <- list()
  for (j in 1:trials_B) {
    dflist_B[[i]][[j]] <- rbind(sample_n(ucb_B,size=nrange_B[i],replace=T),
                                sample_n(ucb_A,size=n_A,replace=T))
    est <- glm(admitted ~ male + depA, data = dflist_B[[i]][[j]], family = "binomial") %>% summary()
    z <- est[["coefficients"]][,3]
    results_B[k,] <- c(as.vector(z),nrange_B[i])
    k <- k +1
  }
}

# add dummy for dept

results_B$male.stars <- 0
results_B$male.stars[abs(results_B$male)>=1.96] <- 1

Number of Significant Resamples for Gender Effect With Unbalanced Strata

table(results_B$n,results_B$male.stars)
##       
##          0   1
##   50   108 892
##   100  132 868
##   150  128 872
##   200  120 880
##   250  144 856
##   300  133 867
##   350  153 847
##   400  194 806
##   450  181 819
##   500  159 841
##   550  186 814
##   600  168 832
##   650  185 815
##   700  197 803
##   750  201 799
##   800  194 806
##   850  191 809
##   900  209 791
##   950  183 817
##   1000 214 786

This reveals a tricky pattern. We see about 90% power when there are either 50 or 100 cases from department B (i.e., 550-600 total including the 500 from department A). With trials_B <- 1000 it’s a bit noisy but still apparent that the power drops as we add cases from B and then rises again along a U-shaped curve.

Normally you’d expect that more sample size would mean more statistical power because standard error is inversely proportional to the square root of degrees of freedom. The trick is that this assumes nothing happens to β. As it happens, UCBAdmissions is a famous example of Simpson’s Paradox and specifically the gender effects are much stronger for department A …

glm(admitted ~ male, data = ucb_A, family = "binomial") %>% summary()
## 
## Call:
## glm(formula = admitted ~ male, family = "binomial", data = ucb_A)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8642  -1.3922   0.9768   0.9768   0.9768  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.5442     0.2527   6.110 9.94e-10 ***
## maleTRUE     -1.0521     0.2627  -4.005 6.21e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1214.7  on 932  degrees of freedom
## Residual deviance: 1195.7  on 931  degrees of freedom
## AIC: 1199.7
## 
## Number of Fisher Scoring iterations: 4

… than the gender effects are for department B.

glm(admitted ~ male, data = ucb_B, family = "binomial") %>% summary()
## 
## Call:
## glm(formula = admitted ~ male, family = "binomial", data = ucb_B)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5096  -1.4108   0.9607   0.9607   0.9607  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   0.7538     0.4287   1.758   0.0787 .
## maleTRUE     -0.2200     0.4376  -0.503   0.6151  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 769.42  on 584  degrees of freedom
## Residual deviance: 769.16  on 583  degrees of freedom
## AIC: 773.16
## 
## Number of Fisher Scoring iterations: 4

Specifically, department A strongly prefers to admit women whereas department B has only a weak preference for admitting women. The pooled model has a dummy to account for department A generally being much less selective than department B but it tacitly assumes that the gender effect is the same as it has no interaction effect. This means that as we increase the size of the department B resample, we’re effectively flattening the gender slope through compositional shifts towards department B and its weaker preference for women.

December 3, 2021 at 2:14 pm

Drop list elements if

I am doing a simulation in R that generates a list of simulated data objects and then simulates emergent behavior on them. Unfortunately, it chokes on “data” meeting a certain set of conditions and so I need to kick data like that from the list before running it. This is kind of hard to do as if you delete a list element you mess up the indexing since when you kick a list element the list is now shorter. My solution is to create a vector of dummies flagging which list elements to kick by indexing in ascending order, but then actually kick list elements in descending order. Note that if the length of the list is used elsewhere in your code that you’ll need to update it after running it.

FWIW, the substantive issue in my case is diffusion where some of the networks contain isolates, which doesn’t work in the region of parameter space which assumes all diffusion occurs through the network. In my case, deleting these networks is a conservative assumption.

Capture

Apologies for posting the code as a screenshot but WordPress seems to really hate recursive loops, regardless of whether I use sourcecode or pre tags.

 

 

April 5, 2019 at 2:43 pm

What is the word for “log” in R?

| Gabriel |

Like most native speakers of Stata, the most natural thing in the world is to start every script or session with a log file. Typing “log using” is like brushing your teeth, it’s the first thing you do and you just feel gross if you haven’t done it. Judging by what you get if you Google “logging in R” this seems to be something of a cultural eccentricity peculiar to Stata users as R users seems not to understand the question. In particular most responses to the question say something like “use sink(),” which ignores that to a Stata user a log file is neither the command history nor the output, but the two interpolated together so that you can see what command elicited what output.

However, much as the frustrated tourist abroad will occasionally find someone who understands what they mean in asking for a Western toilet, one great StackOverflow user speaks sufficient Stata to direct us to what we were hoping to find. Specifically, the library “TeachingDemos” includes a “txtStart()” function that behaves almost exactly like a Stata log file by default, but where you also have various options such as to suppress commands/output or to use Markdown format.

To install TeachingDemos:

install.packages('TeachingDemos')

Thereafter invoke it start a log file, do your work, and close it:

library(TeachingDemos)
txtStart('mylogfile.txt') # this is similar to "log using mylogfile.txt" in Stata
#insert code here to load your data, analyze it, etc
txtStop() # this is similar to "log close" in Stata

October 9, 2014 at 3:48 pm 4 comments

Scraping Using twitteR (updated)

| Gabriel |

Last time I described using the twitteR library for R. In that post I had R itself read over a list to loop. In this post I make the have looping occur in Bash with argument passing to R through the commandArgs() function.

First, one major limitation of using Twitter is that it times you out for an hour after 150 queries. (You can double this if you use OAuth but I’ve yet to get that to work). For reasons I don’t really understand, getting one feed can mean multiple queries, especially if you’re trying to go far back in the timeline. For this reason you need to break up your list into a bunch of small lists and cron them at least 80 minutes apart. This bit of Bash code will split up a file called “list.txt” into several files. Also, to avoid trouble later on, it makes sure you have Unix EOL.

split -l 50 list.txt short_tw
perl -pi -e 's/\r\n/\n/g' short_tw*

The next thing to keep in mind is that you’ll need to pass arguments to R. Argument passing is when a script takes input from outside the script and processes it as variables. The enthusiastic use of argument passing in Unix is the reason why there is a fine line between a file and a command in that operating system.

In theory you could have R read the target list itself but this crashes when you hit your first dead URL. Running the loop from outside R makes it more robust but this requires passing arguments to R. I’d previously solved this problem by having Stata write an entire R script, which Stata understood as having variables (or “macros”) but which from R’s perspective was hard-coded. However I was recently delighted to discover that R can accept command-line arguments with the commandArgs() function. Not surprisingly, this is more difficult than $1 in Bash, @ARGV in Perl, or `1′ in Stata, but it’s not that bad. To use it you have to use the “–args” option when invoking R and then inside of R you use the commandArgs() function to pass arguments to an array object, which behaves just like the @ARGV array in Perl.

Here’s an R script that accepts a Twitter screenname as a command-line argument, uses the twitteR library to collect that feed, and then saves it as a tab-delimited text file of the same name. (It appends if there’s an existing file). Also note that (thanks to commenters on the previous post) it turns internal EOL into regular spaces. It’s currently set to collect the last 200 tweets but you can adjust this with the third line (or you could rewrite the script to make this a command-line argument as well).

args <- commandArgs(trailingOnly = TRUE)
library(twitteR)
howmany <- 200 #how many past tweets to collect

user <- args[1]
outputfile <- paste('~/project/feeds/',user,'.txt',sep="")
print(user)
print(outputfile)

tmptimeline <- userTimeline(user,n=as.character(howmany))
tmptimeline.df <- twListToDF(tmptimeline)
tmptimeline.df$text <- gsub("\\n|\\r|\\t", " ", tmptimeline.df$text)
write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE)

quit()

To use the script to get just a single feed, you invoke it like this from the command-line.

R --vanilla --args asanews < datacollection.R

Of course the whole reason to write the script this way is to loop it over the lists. Here it is for the list “short_twaa”.

for i in `cat short_twaa`; do R --vanilla --args $i < datacollection.R ; done

Keep in mind that you’ll probably want to cron this, either because you want a running scrape or because it makes it easier to space put the “short_tw*” files so you don’t get timed out.

December 20, 2011 at 11:04 pm 1 comment

Scraping Using twitteR

| Gabriel |

Previously I’d discussed scraping Twitter using Bash and Perl. Then yesterday on an orgtheory thread Trey mentioned the R library twitteR and with some help from Trey I worked out a simple script that replaces the twitterscrape_daily.sh and twitterparse.pl scripts from the earlier workflow. The advantage of this script is that it’s a lot shorter, it can get an arbitrary number of tweets instead of just 20, and it captures some of the meta-text that could be useful for constructing social networks.

To use it you need a text file that consists of a list of Twitter feeds, one per line. The location of this file is given in the “inputfile” line.

The “howmany” line controls how many tweets back it goes in each feed.

The “outputfile” line says where the output goes. Note that it treats it as append. As such you can get some redundant data, which you can fix by running this bash code:

sort mytweets.txt | uniq > tmp
mv tmp mytweets.txt

The outputfile has no headers, but they are as follows:

#v1 ignore field, just shows number w/in query
#v2 text of the Tweet
#v3 favorited dummy
#v4 replytosn (mention screenname)
#v5 created (date in YMDhms)
#v6 truncated dummy
#v7 replytosid
#v8 id
#v9 replytouid
#v10 statussource (Twitter client)
#v11 screenname

Unfortunately, the script doesn’t handle multi-line tweets very well, but I’m not sufficiently good at R to regexp out internal EOL characters. I’ll be happy to work this in if anyone cares to post some code to the comments on how to do a find and replace that zaps the internal EOL in the field tmptimeline.df$text.

library(twitteR)
howmany <- 30 #how many past tweets to collect
inputfile <- "~/feedme.txt"
outputfile <- "~/mytweets.txt"

feeds <- as.vector(t(read.table(inputfile)))
for (user in feeds) {
	tmptimeline <- userTimeline(user,n=as.character(howmany))
	tmptimeline.df <- twListToDF(tmptimeline)
	write.table(tmptimeline.df,file=outputfile,append=TRUE,sep="\t",col.names=FALSE)
}

Finally, if you import it into Stata you’ll probably want to run this:

drop v1
ren v2 text
ren v3 favorited 
ren v4 replytosn
ren v5 created
ren v6 truncated 
ren v7 replytosid
ren v8 id
ren v9 replytouid
ren v10 statussource
ren v11 screenname
gen double timestamp=clock(subinstr(created,"-","/",.),"YMDhms")
format timestamp %tc

December 13, 2011 at 1:21 pm 5 comments

Chain of Litigation

| Gabriel |

I was intrigued by the FT infographic on cell phone patent suits and decided to reformat it with F-R layout to get a big picture. A few things leap out. First, there is some pretty serious reciprocity (aka, counter-suits) going on, especially with Apple. On the other hand, Microsoft seems to be pretty good at attacking people who aren’t in a position to fight back. (*cough*trolls*cough*). Second, Google is at the periphery of the network which is pretty strange since many of these suits are actually about Android. This highlights the litigation strategy of picking off the weak members of the Android herd rather than taking the fight directly to Google itself. Furthermore, it suggests a data issue that there are omitted ties from the network, specifically positive ties between firms in the form of alliances (especially the Open Handset Alliance), and that reading the graph without these positive ties is misleading.

Anyway, here’s the graph. Click on it for a scalable PDF. Click here for the data in “.net” format. Code to produce the graph is below the fold.

(more…)

October 18, 2011 at 2:30 pm

Importnew.ado (requires R)

| Gabriel |

After hearing from two friends in a single day who are still on Stata 10 that they were having trouble opening Stata 12 .dta files, I rewrote my importspss.ado script to translate Stata files into an older format, by default Stata 9.

I’ve tested this with Stata 12 and in theory it should work with older versions, but please post positive or negative results in the comments. Remember that you need to have R installed. Anyway, I would recommend handling the backwards compatibility issue on the sender’s side with the native “saveold” command, but this should work in a pinch if for some reason you can’t impose on the sender to fix it and you need to fix it on the recipient’s end. Be especially careful if the dataset includes formats that Stata’s been updating a lot lately (e.g., the date formats).

The syntax is just:

importnew foo.dta

Here’s the code:

*importnew.ado
*by GHR 10/7/2011
*this script uses R to make new Stata files backwards compatible
* that is, use it when your collaborator forgot to use "saveold"

*use great caution if you are using data formats introduced in recent versions
* eg, %tb

*DEPENDENCY: R and library(foreign)
*if R exists but is not in PATH, change the reference to "R" in line 29 to be the specific location

capture program drop importnew
program define importnew
	set more off
	local future `1'
	local version=9  /* version number for your copy of Stata */ 
	local obsolete=round(runiform()*1000)
	local sourcefile=round(runiform()*1000)
	capture file close rsource
	file open rsource using `sourcefile'.R, write text replace
	file write rsource "library(foreign)" _n
	file write rsource `"directory <- "`c(pwd)'" "' _n
	file write rsource `"future <- "`future'" "' _n
	file write rsource `"obsolete <- paste("`obsolete'",".dta",sep="") "' _n
	file write rsource "setwd(directory)" _n
	file write rsource `"data <- read.dta(future, convert.factors=TRUE, missing.type=FALSE)"' _n
	file write rsource `"write.dta(data, file=obsolete, version=`version')"' _n
	file close rsource
	shell R --vanilla <`sourcefile'.R
	erase `sourcefile'.R
	use `obsolete'.dta, clear
	erase `obsolete'.dta
end
*have a nice day

October 10, 2011 at 11:15 am

Misc Links

| Gabriel |

  • Useful detailed overview of Lion. The user interface stuff doesn’t interest me nearly as much as the tight integration of version control and “resume.” Also, worth checking if your apps are compatible. (Stata and Lyx are supposed to work fine. TextMate is supposed to run OK with some minor bugs. No word on R. Fink doesn’t work yet). It sounds good but I’m once again sitting it out for a few months until the compatibility bugs get worked out. Also, as with Snow Leopard many of the features won’t really do anything until developers implement them in their applications.
  • I absolutely loved the NPR Planet Money story on the making of Rihanna’s “Man Down.” (Not so fond of the song itself, which reminds me of Bing Crosby and David Bowie singing “Little Drummer Boy” in matching cardigans). If you have any interest at all in production of culture read the blog post and listen to the long form podcast (the ATC version linked from the blog post is the short version).
  • Good explanation of e, which comes up surprisingly often in sociology (logit regression, diffusion models, etc.). I like this a lot as in my own pedagogy I really try to emphasize the intuitive meaning of mathematical concepts rather than just the plug and chug formulae on the one hand or the proofs on the other.
  • People are using “bimbots” to scrape Facebook. And to think that I have ethical misgivings about forging a user-agent string so wget looks like Firefox.

July 20, 2011 at 3:46 pm

Misc Links

  • Lisa sends along this set of instructions for doing a wide-long reshape in R. Useful and I’m passing it along for the benefit of R users, but the relative intuition and simplicity of “reshape wide stub, i(i) j(j)” is why I still do my mise en place in Stata whenever I use R. Ideally though, as my grad student Brooks likes to remind me, we really should be doing this kind of data mise en place in a dedicated database and use the Stata and R ODBC commands/functions to read it in.
  • The days change at night, change in an instant.”
  • Anyone interested in replicating this paper should be paying close attention to this pending natural experiment. In particular I hope the administrators of this survey are smart enough to oversample California in the next wave. I’d consider doing the replication myself but I’m too busy installing a new set of deadbolts and adopting a dog from a pit bull rescue center.
  • In Vermont, a state government push to get 100% broadband penetration is using horses to wire remote areas that are off the supply curve beaten path. I see this as a nice illustration both of cluster economies and of the different logics used by markets (market clearing price) and states (fairness, which often cashes out as universal access) in the provision of resources. (h/t Slashdot)
  • Yglesias discusses some poll results showing that voters in most of the states that recently elected Republican governors now would have elected the Democrats. There are no poll results for California, the only state that switched to the Democrats last November. Repeat after me: REGRESSION TO THE MEAN. I don’t doubt that some of this is substantive backlash to overreach on the part of politically ignorant swing voters who didn’t really understand the GOP platform, but really, you’ve still got to keep in mind REGRESSION TO THE MEAN.
  • Speaking of Yglesias, the ThinkProgress redesign only allows commenting from Facebook users, which is both a pain for those of us who don’t wish to bear the awesome responsibility of adjudicating friend requests and a nice illustration of how network externalities can become coercive as you reach the right side of the s-curve.

May 31, 2011 at 10:22 am

Another R bleg about loops and graph devices

| Gabriel |

I’m having another problem with R and was hoping somebody could help me out in the comments. Long story short, I can make graphs properly if I do them one at a time, but when I try to loop it I get device errors. In particular it creates the PDFs but they are either empty or corrupted.

Here is the log, which first fails to do it with a loop, then does it right for one case where I manually assign the variables rather than looping.

> # File-Name:       surfacegraphs.R                 
> # Date:            2011-04-25
> # Author:          Gabriel Rossman                                       
> # Purpose:         graph from Stata
> # Packages Used:   lattice   
> # note, wireframe code from lisa 
> timestamp()
##------ Tue Apr 26 10:44:09 2011 ------##
> library(lattice)
> 
> histopath <- '~/Documents/project/histograms'
> image2 <- '~/Documents/project/images/histograms'
> 
> timestamp()
##------ Tue Apr 26 10:44:10 2011 ------##
> 
> #create surface histograms, showing how population evolves over time
> #  parameters held constant
> setwd(histopath)
> for(d in 0:10) {
+ 	for(p in 0:5) {
+ 		d10 <- d*10
+ 		p100 <- p*100
+ 		datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
+ 		dataobject <- read.table(file=datafile,header=TRUE)
+ 		pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
+ 		pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
+ 		pdf(pdfcolor)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
+ 		dev.off()
+ 		
+ 		pdf(pdfgrey)
+ 		wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE, par.settings=standard.theme(color=FALSE))
+ 		dev.off()
+ 	}
+ }
There were 50 or more warnings (use warnings() to see the first 50)
> timestamp()
##------ Tue Apr 26 10:44:12 2011 ------##
> 
> #loop doesn't work
> #  seems to be the dev.off()
> #try a few manually
> d10 <- 0
> p100 <- 0
> datafile <- paste(histopath,'/d',d10,'p',p100,'.txt', sep="")
> dataobject <- read.table(file=datafile,header=TRUE)
> pdfcolor <- paste(image2,'/hist_color_d',d10,'p',p100,'.pdf', sep="")
> pdfgrey <- paste(image2,'/hist_grey_d',d10,'p',p100,'.pdf', sep="")
> pdf(pdfcolor)
> wireframe( dataobject$z~dataobject$x*dataobject$y, shade=TRUE) 
> dev.off()
null device 
          1 
> 
> 
> timestamp()
##------ Tue Apr 26 10:44:14 2011 ------##
> 
> 

The warnings start like this and go on from there:

> warnings()
Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
5: In min(x) : no non-missing arguments to min; returning Inf

Any ideas?
Do I just need to hard-code the graphs I really want rather than batching them?

[Update]
As Michal suggested, I needed to wrap wireframe in print. Here’s an example of the output (for a baseline simulation).
hist_color_d0p0

April 27, 2011 at 4:51 am 7 comments

Older Posts


The Culture Geeks