Gross.pl
March 31, 2010 at 3:40 pm GR 4 comments
| Gabriel |
A few months ago I talked about reshaping field-tagged data and gave some clumsy advice for doing so. I’ve now written a perl script that does this more elegantly. It’s written to extract movie title (“MV”) and domestic box office (“GR”) from the IMDB file business.list, but you could adapt it to get other variables and/or work on other field-tagged data.
Basically, the script will turn this:
------------------------------------------------------------------------------- MV: Little Shop of Horrors (1986) AD: 118,418 (Sweden) BT: USD 30,000,000 GR: USD 34,656,704 (USA) (8 February 1987) GR: USD 33,126,503 (USA) (1 February 1987) GR: USD 30,810,276 (USA) (25 January 1987) GR: USD 27,781,027 (USA) (18 January 1987) GR: USD 23,727,232 (USA) (11 January 1987) GR: USD 19,546,049 (USA) (4 January 1987) GR: USD 11,412,248 (USA) (28 December 1986) GR: USD 3,659,884 (USA) (21 December 1986) GR: USD 38,747,385 (USA) GR: SEK 4,318,255 (Sweden) OW: USD 3,659,884 (USA) (21 December 1986) (866 screens) RT: USD 19,300,000 (USA) SD: 21 October 1985 - ? WG: USD 1,112,016 (USA) (8 February 1987) (871 screens) WG: USD 1,719,329 (USA) (1 February 1987) WG: USD 2,093,847 (USA) (25 January 1987) WG: USD 3,222,066 (USA) (18 January 1987) WG: USD 3,057,666 (USA) (11 January 1987) (858 screens) WG: USD 4,004,838 (USA) (4 January 1987) (866 screens) WG: USD 5,042,682 (USA) (28 December 1986) (866 screens) WG: USD 3,659,884 (USA) (21 December 1986) (866 screens) -------------------------------------------------------------------------------
Into this:
Little Shop of Horrors (1986) 34,656,704 (USA) (8 February 1987) Little Shop of Horrors (1986) 33,126,503 (USA) (1 February 1987) Little Shop of Horrors (1986) 30,810,276 (USA) (25 January 1987) Little Shop of Horrors (1986) 27,781,027 (USA) (18 January 1987) Little Shop of Horrors (1986) 23,727,232 (USA) (11 January 1987) Little Shop of Horrors (1986) 19,546,049 (USA) (4 January 1987) Little Shop of Horrors (1986) 11,412,248 (USA) (28 December 1986) Little Shop of Horrors (1986) 3,659,884 (USA) (21 December 1986) Little Shop of Horrors (1986) 38,747,385 (USA)
Here’s the code:
#!/usr/bin/perl #gross.pl by ghr #this script cleans the IMDB file business.list #raw data is field-tagged, key tags are "MV" (movie title) and "GR" (gross) #record can have multiple "gross" fields, only interested in those with "(USA)" #ex #MV: Astronaut's Wife, The (1999) #GR: USD 10,654,581 (USA) (7 November 1999) #find "MV" tag, keep in memory, go to "GR" tag and write out as "GR\tMV" use warnings; use strict; die "usage: gross.pl <IMDB business file>\n" unless @ARGV==1; my $rawdata = shift(@ARGV); # if line=MV, redefine the "title" variable # if line=GR, write out with "title" in front #optional, screen out non "USA" gross, parse GR into #"currency, quantity, country, date" my $title ; my $gross ; open(IN, "<$rawdata") or die "error opening $rawdata for reading\n"; open(OUT, ">gross.txt") or die "error creating gross.txt\n"; print OUT "title\tgross\n"; while (<IN>) { #match "MV" lines by looking for lines beginning "MV: " if($_=~ /^MV: /) { $title = $_; $title =~ s/\015?\012//; #manual chomp $title =~ s/^MV: //; #drop leading tag print "$title "; } #match "GR" lines, write out with clid if ($_ =~ m/^GR: USD .+\(USA\)/) { $gross = $_; $gross =~ s/\015?\012//; #manual chomp $gross =~ s/^GR: USD //; #drop leading tag print OUT "$title\t$gross\n"; } } close IN; close OUT; print "\ndone\n";
Entry filed under: Uncategorized. Tags: cleaning, IMDB, perl.
1.
Noah | April 1, 2010 at 1:45 pm
You wrote code that didn’t end with have a nice day? What am I supposed to think now?
2. wos2tab.pl « Code and Culture | July 19, 2010 at 2:15 pm
[…] than that). To get the data into tab-delimited text, I modified an earlier script I wrote for parsing field-tagged IMDb files (in my case business.list but most of the film-level IMDb files are structured similarly). The […]
3. imdb_personnel.pl « Code and Culture | July 26, 2010 at 4:21 am
[…] remarked, IMDb files have a weird structure that ain’t exactly ready to rock. I already posted a file for dealing with business.list (which could also be modified to work with files like certificates.list). The personnel files […]
4. Getting long flat-files out of field-tagged data « Code and Culture | August 12, 2010 at 4:22 am
[…] actually already done this kind of thing twice, with my code for cleaning Memetracker and the IMDb business file. However those two datasets had the convenient property that the record key appears in the first […]