Keep the best 5

March 12, 2010 at 5:05 am 11 comments

| Gabriel |

In my undergrad lecture class I give pop quizzes, which are intended to measure both attendance and reading comprehension. Since I think an absence or two is understandable, I told the students I’d only count their best five quizzes out of seven offered. Of course this meant I had to figure out a way to calculate this, which is a lot harder to do than a simple sum. After some failed experimentation with Stata, I found this was easy to do in perl. This is because perl likes to process data row by row and its “sort()” function sorts elements in an array (rather than rows in a spreadsheet, like the “sort” commands in Stata or Excel). To view it from a Stata-centric or Excel-centric perspective, perl finds it quite natural to sort columns/variables within a row/record.

[Update: also see the comments for some solutions using Excel or R.]

Here’s the perl script

# Created by Gabriel Rossman, 2010-03-10
# this file cleans the grades by sorting the quizzes. this makes it easy to find the best 5

use strict; use warnings;
die "usage: <in.txt> <out.txt>\n" unless @ARGV == 2;

#read-write boilerplate
my $infile = shift (@ARGV);
my $outfile = shift (@ARGV);
open(IN, "<$infile") or die "error reading $infile";
open(OUT, ">$outfile") or die "error creating $outfile";
#loop to read, process, and write line by line
while (<IN>) {
	chomp; #drop \n
	my @fields = split("\t", $_); #parse STDIN into fields
	my $uid = shift (@fields); #lop off the first field, call it "uid" or University ID, which is my key variable
	my @sortedlist = sort {$b <=> $a} @fields; #descending sort other fields
	print OUT "$uid @sortedlist\n"; #write to disk as space-delimited text
close IN;
close OUT;

To execute it from the command line you’d type
perl dirtygrades.txt cleangrades.txt

You could easily use this script with, say, Excel, but I wrap it in a Stata script.

cd "~/Documents/w10_m176/exams/myuclagradebook/"
insheet using grades.csv, comma clear

drop if uid==. & midterm==.

sort uid
save grades.dta, replace
keep uid q1-q7
outsheet using grades.txt, nonames noquote replace

*send to perl to sort quizzes by score (by row)
shell perl grades.txt gradesclean.txt

insheet using gradesclean.txt, clear delim(" ")
ren v1 uid
ren v2 qg1
ren v3 qg2
ren v4 qg3
ren v5 qg4
ren v6 qg5
ren v7 qg6
ren v8 qg7
sort uid
merge 1:1 uid using grades
drop _merge q1-q7
gen q_top5=(qg1+qg2+qg3+qg4+qg5)/5

*have a nice day

Entry filed under: Uncategorized. Tags: , .

Soc of mass media, week 10 Fiddler’s Green


  • 1. brubineau  |  March 12, 2010 at 9:56 am

    keeping it all in excel (untested & perhaps inelegant):


    i enjoy learning from your code.

  • 2. brubineau  |  March 12, 2010 at 9:58 am

    oops, last line should be:


  • 3. fojtasek  |  March 12, 2010 at 2:04 pm

    I would have done this in R — something like this:

    sorted<-apply(dat,1,sort) # apply sort to each row
    sorted<-aperm(sorted) # undoes transposition caused by sort
    clean<-sorted[,3:max(col(sorted))] # drop first two columns

  • 4. gabrielrossman  |  March 12, 2010 at 2:22 pm

    thanks to both of you guys for the code. i’m not surprised that it’s doable in both Excel and R but i wouldn’t have been able to figure out how without a lot of head-scratching. the mega-recursive excel function is terrifying looking but probably the best bet for most people since it keeps everything in the program they probably use for record-keeping. anyway, thanks to both of you.

  • 5. John-Paul Ferguson  |  March 12, 2010 at 7:31 pm

    Am I missing something? If you want to keep their best five of seven quiz scores, why not use -reshape- to put the data into a long format, then sort on “student_id” and “quizscore”, keep by student_id if n<=5 and then total quizscore by student_id?

    • 6. gabrielrossman  |  March 12, 2010 at 8:07 pm

      i don’t know why it didn’t occur to me to reshape — i was kind of fixated on the idea of keeping the data wide and doing the kind of thing i eventually did in perl, brian did in Excel, and geoffrey did in R. i guess it was a framing thing, that i was thinking of quizzes as xsec traits of students rather than as panel data.

    • 7. MHollister  |  March 13, 2010 at 6:02 am

      I was going to suggest the same thing. One note, though: if you use the regular sort and you want to drop the lowest two then once you’re in long form you want to

      drop if _n<=2

      also, if there are missing values you'll want to replace them with zeros before sorting.

  • 8. John-Paul Ferguson  |  March 12, 2010 at 7:33 pm

    FYI, Brian: At least through Office 2003, Excel only allowed seven levels of nested IF statements. So your example works but it wouldn’t if one assigned any more quizzes! I think they’ve changed this.

  • 9. Dulani  |  March 22, 2010 at 12:37 pm

    I have another way I’d do this in Excel that doesn’t involve “if” statements, scripting, or sorting. Assuming your 7 quizzes are in columns A through G, you could put this formula in column H, fill down and voila, the top 5 scores have been averaged!
    Excel formula:

  • 10. Dulani  |  March 22, 2010 at 12:39 pm

    Looks like the blog format cut off my Excel formula. Here’s the whole thing with carriage returns inserted.


  • 11. Keep the best 5 (updated) « Code and Culture  |  November 24, 2010 at 4:26 am

    […] year I mentioned my policy of assigning about seven quizzes and then keeping the best 5. I then had a real Rube Goldberg-esque workflow that involved piping to Perl. Several people came […]

The Culture Geeks

%d bloggers like this: