Oscar Appeal

July 29, 2013 at 8:06 am 5 comments

| Gabriel |

This post contains two Stata do-files for constructing the “Oscar appeal” variable at the center of Rossman & Schilke “Close But No Cigar.”

The first file takes IMDb files (cleaned and imported into Excel), reads them into Stata, and does various cleaning, most notably the calculation of lambda and tau.

*mergeorama.do
*Stata do-file for Rossman & Schilke "Close But No Cigar"
*Please direct any questions to rossman@soc.ucla.edu

*This do-file imports raw data and precodes it, including coding lambda_it and tau_j

*You can find the raw IMDb data at http://www.imdb.com/interfaces
*This do-file assumes that you have pre-cleaned the IMDb files and stored them XLSX format
*Note that IMDb files cannot be imported directly into Stata but require pre-cleaning for several reasons
* 1. Many IMDb files have a field-tagged data structure whereas Stata expects rectangular (rows are cases, columns are variables) data. Even the rectangular artist-level datasets are odd, having a different format for the first row describing an artist than for subsequent rows
*    For Perl scripts useful for changing the data structure of IMDb files, please see
*    http://codeandculture.wordpress.com/2010/07/26/imdb_personnel-pl/
*    http://codeandculture.wordpress.com/2010/03/31/gross-pl/
* 2. The IMDb files are extremely large and so for memory purposes it is desirable to drop as much data as possible using one-line-at-a-time approaches like Perl or Bash before using software like Stata or R that reads a file completely into memory
*This one-liner drops most of the tv episodes, direct-to-video, video games, and pre-1970 films. It should be run *after* the step 1 scripts.
*perl -ne '!/.+\) {.+/ && !/.+[0-9]\) \(T?VG?\)/ && !/.+\(19[0-6][0-9]\).*/ && print' foo.txt > bar.txt
* 3. Stata's insheet command chokes on certain characters. (The usual error is to skip several hundred or thousand rows of data). Rather than identifying and eliminating all of these special characters, we find it easiest to pass the data through Excel as our tests show that this works reliably with no degradation in data quality or quantity.
*The IMDb license prevents us from recirculating our copy of the data. Please contact rossman@soc.ucla.edu for more advice on pre-cleaning the data

*Our file structure is unfortunately somewhat complicated but we always refer to paths through globals. There are no hard-coded paths so changing these globals to fit your own file system should solve everything
global parentpath "~/Dropbox/oscarsperformativity/studiosystem"
global oscarsfolder "~/Documents/oscars"
global bigstuff "$oscarsfolder/studiosystemoverflow"
global recent_dir "$oscarsfolder/dataapril2010/maria"
global recent_eligibles "$recent_dir/filmtitles_Apr2010.txt"
global recent_oscars "$recent_dir/oscars20062010.txt"
global tables "$parentpath/tables"
global rawimdbfiles "$oscarsfolder/dataapril2010"

*These switches turn batches of code on and off. This both facilitates code-folding and debugging. For final run though make sure everything is turned on
*switches
local keywords	1
local genres	1
local mpaa	1
local pnomimport	1
local distributors	1
local releasedates	1

*options
local minkwfreq	10 /*how many times must keyword occur throughout period to count*/
local window 5 /*how long is the time window for calculating y-hat, tau, etc */
local startdate 1985 /*when does the _analysis_ start ? (lagged covariates start earlier)*/
local enddate 2009
local prestart=`startdate'-`window' /*covariates like tau are lagged and so start before the analysis*/

capture log close
log using $parentpath/mergeorama.log, replace

cd $bigstuff

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*distributors
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
if `distributors'==1 {
	import excel "$oscarsfolder/dataapril2010/distributors_1970andup_noTV.xlsx", sheet("distributors_1970andup_noTV.txt") clear
	ren A imdbtitle_raw
	ren B distributor
	ren C release

	gen nation=regexs(1) if regexm(distributor,"\[(.+)\]")
	keep if nation=="" | regexm(nation,"us")==1
	gen releasedate=real(regexs(1)) if regexm(release,"^\(([1-2][0-9][0-9][0-9])")
	gen filmdate=real(regexs(1)) if regexm(imdbtitle_raw,"\(([1-2][0-9][0-9][0-9])")
	drop if filmdate<1950
	drop if releasedate> filmdate+5 & releasedate!=.
	*eliminate release dates that clearly don't reflect the original theatrical release
	drop if regexm(release,"\(re\-release\)")
	drop if regexm(release,"\(laserdisc\)")
	drop if regexm(release,"\(video\)")
	drop if regexm(release,"\(laserdisc\)")
	drop if regexm(release,"\(non-USA\)")
	*among ambiguous cases choose the one that most closely reflects the US theatrical release
	gen priority=99
	replace priority=1 if regexm(release,"\(USA\) \(theatrical\)")
	replace priority=1.5 if regexm(release,"\(North America\) \(theatrical\)")
	replace priority=2 if regexm(release,"\(USA\) \(all media\)")
	replace priority=2.5 if regexm(release,"\(North America\) \(all media\)")
	replace priority=3 if regexm(release,"\(worldwide\) \(all media\)")
	replace priority=3 if regexm(release,"\(worldwide\) \(theatrical\)")
	replace priority=10 if regexm(release,"^\([12][0-9]+\) \(USA\)") & priority==99
	replace priority=10.5 if regexm(release,"^\([12][0-9]+\) \(North America\)") & priority==99

	replace distributor=subinstr(distributor," [us]","",1)

	sort imdbtitle_raw priority
	*debug
	list if regexm(imdbtitle_raw ,"Girl in a")
	list if regexm(imdbtitle_raw ,"Pulp Fiction")
	by imdbtitle_raw : gen rownum=[_n]
	replace rownum=rownum[_n-1] if imdbtitle_raw==imdbtitle_raw[_n-1] & priority==priority[_n-1]
	replace rownum=rownum[_n-2] if imdbtitle_raw==imdbtitle_raw[_n-2] & priority==priority[_n-2]
	*debug
	list if regexm(imdbtitle_raw ,"Girl in a")
	list if regexm(imdbtitle_raw ,"Pulp Fiction")
	keep if rownum==1
	drop rownum priority

	merge m:1 imdbtitle_raw using $parentpath/rawmerge2
	keep if _merge==3
	drop imdbtitle_raw imdbtitle_merge _merge releasedate nation filmdate
	order imdbtitle year
	compress
	gen major=0 /*core division of major studios*/
	gen indymajor=0 */ "independent film" subsidiaries of major studios*/
	*note, true independents are ommitted category
	foreach major in "Columbia Pictures" "Columbia TriStar Film Distributors International" "Metro-Goldwyn-Mayer (MGM)" "Orion Pictures Corporation" "Paramount Pictures" "Sony Pictures Entertainment (SPE)" "TriStar Pictures" "Twentieth Century Fox Film Corporation" "United Artists" "Universal Pictures" "Warner Bros. Pictures" "Warner Brothers" {
		replace major=1 if distributor=="`major'"
	}
	replace major=1 if regexm(distributor,"DreamWorks") /*regexm captures several variations*/
	foreach indymajor in "Fox Searchlight Pictures" "Focus Features" "Sony Pictures Classics" "Warner Independent Pictures (WIP)" { /*"Focus" is part of Universal, in other cases relationship is obvious*/
		replace indymajor=1 if distributor=="`indymajor'"
	}
	*for indymajor's that were acquired and/or divested during period, make coding time-varying
	replace indymajor=1 if distributor=="Dimension Films" & year>=1993 & year<2005 /* horror film division of Miramax, acquired by Disney as part of Miramax, but divested when the Weinstein's split to form "Weinstein Company" */
	replace indymajor=1 if distributor=="Miramax Films" & year>=1993  /* Disney radically restructured/sold off in late 2010 */
	replace indymajor=1 if distributor=="Samuel Goldwyn Company, The" & year>=1997
	replace indymajor=1 if distributor=="New Line Cinema" & year>=1996 /*TW / Turner merger*/
	replace indymajor=1 if distributor=="Fine Line Features" & year>=1996 /*TW / Turner merger*/

	collapse (max) major indymajor, by (imdbtitle year)
	replace major=0 if indymajor==1

	*debug
	list if regexm(imdbtitle,"Girl in a")
	list if regexm(imdbtitle,"Pulp Fiction")
	list if regexm(imdbtitle,"Left Foot")
	*note, "Girl" should not be "indymajor" but Pulp Fiction should

	sum *major*

	save $bigstuff/distributors.dta, replace
}

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*keywords
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
if `keywords'==1 {
	*just imports, lambda and tau are later
	import excel "$rawimdbfiles/keywords_1970andup_noTVG_a.xlsx", sheet("keywords_1970andup_noTVG_a.txt") clear
	save $bigstuff/keywords.dta, replace
	import excel "$rawimdbfiles/keywords_1970andup_noTVG_b.xlsx", sheet("keywords_1970andup_noTVG_b.txt") clear
	append using $bigstuff/keywords.dta
	ren A imdbtitle_raw
	ren B keyword

	drop if regexm(imdbtitle,"\) \(T?VG?\)")
	drop if regexm(keyword,"oscar-nom")
	drop if regexm(keyword,"oscar-win")
	drop if regexm(keyword,"academy-award")
	drop if regexm(keyword,"critically-acclaimed")
	sort keyword
	by keyword: drop if [_N]<`minkwfreq' /*eliminate extremely rare keywords*/
	compress
	sort imdbtitle
	merge m:1 imdbtitle_raw using $parentpath/rawmerge2
	keep if _merge==3
	drop _merge imdbtitle_raw imdbtitle_merge
	sort keyword
	preserve
	contract keyword
	sum _freq, detail
	local maxkwfreq=`r(p99)'
	sort _freq
	list if _freq>=`maxkwfreq'
	restore
	by keyword: drop if [_N]<`minkwfreq'
	by keyword: drop if [_N]>`maxkwfreq'
	sort imdbtitle
	compress
	save $bigstuff/keywords.dta, replace
	contract keyword
	list if _freq>50
	save $bigstuff/keywords_freqs, replace
	outsheet using $bigstuff/keywords_freqs.txt, replace
}

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*genres
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
if `genres'==1{
	*just imports, lambda and tau are later
	import excel "$rawimdbfiles/genres_1970andup_noTVG.xlsx", sheet("genres_1970andup_noTVG.txt") clear
	ren A imdbtitle_raw
	ren B genre
	merge m:1 imdbtitle_raw using $parentpath/rawmerge2
	keep if _merge==3
	drop _merge imdbtitle_raw imdbtitle_merge
	drop if genre=="Adult"
	drop if genre=="Reality-TV"
	drop if genre=="Short"
	drop if regexm(imdbtitle,"\(T?VG?\)$")
	replace genre=lower(genre)
	replace genre=regexs(1)+regexs(2) if regexm(genre,"(.+)-(.+)")
	save $bigstuff/genre_long, replace

	foreach g in action adventure biography comedy crime drama family fantasy filmnoir history horror music musical mystery romance scifi sport thriller war western {
		gen g_`g'=0
		replace g_`g'=1 if genre=="`g'"
	}
	sort imdbtitle
	gen rows=1
	collapse (sum) rows (max) g_action g_adventure g_biography g_comedy g_crime g_drama g_family g_fantasy g_filmnoir g_history g_horror g_music g_musical g_mystery g_romance g_scifi g_sport g_thriller g_war g_western, by (imdbtitle year)
	sort imdbtitle
	save $bigstuff/genre.dta, replace
}

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*mpaa
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
if `mpaa'==1{
	import excel "$parentpath/film_ratings_short.xlsx", sheet("film_ratings_short.csv") firstrow clear
	*for documentation: the additional data came from filmratings.com, inbaseline.com, and movies.yahoo.com
	keep imdbtitle mpaa
	ren imdbtitle imdbtitle_raw
	ren mpaa rating
	save $bigstuff/certificates_handcoded.dta, replace

	insheet using $oscarsfolder/dataapril2010/certificates.txt, tab clear
	ren v1 imdbtitle_raw
	ren v2 rating
	ren v3 notes
	append using $bigstuff/certificates_handcoded.dta
	merge m:1 imdbtitle_raw using $parentpath/rawmerge2
	list if regexm(imdbtitle,"Imaginarium of Doctor Parnassus") /*debug*/
	drop _merge imdbtitle_raw imdbtitle_merge
	gen priority=0
	replace notes=regexs(1)+" "+regexs(2) if regexm(notes,"(.+)'(.+)")   /*similar to "Rosemary's Baby" but for "director's cut*/
	replace priority=-1 if regexm(notes,"^\(re-rating\) \([0-9][0-9][0-9][0-9]\)")
	replace priority=-1 if regexm(notes,"^\(DVD.*\)")
	replace priority=-1 if regexm(notes,"^\(VHS.*\)")
	replace priority=-1 if regexm(notes,"^\(video.*\)")
	replace priority=-1 if regexm(notes,"^\(TV rating\)")
	replace priority=-1 if regexm(notes,"^\(director s cut\)")
	gen mpaa=0
	replace mpaa=1 if rating=="USA:G" | rating=="G"
	replace mpaa=2 if rating=="USA:PG" | rating=="PG"
	replace mpaa=3 if rating=="USA:PG-13" | rating=="PG13"
	replace mpaa=4 if rating=="USA:R" | rating=="R"
	replace mpaa=5 if rating=="USA:NC-17" | rating=="NC17"
	replace mpaa=6 if rating=="USA:X" | rating=="X"
	replace mpaa=7 if rating=="USA:Unrated" | rating=="USA:Not Rated" | rating=="NR"
	recode mpaa 6/7=5 /*fold X & unrated into NC17*/
	capture lab drop mpaa
	lab def mpaa 1 "G" 2 "PG" 3 "PG13" 4 "R" 5 "else"
	notes mpaa: category 5 includes X, NC17, and unrated
	list if regexm(imdbtitle,"Imaginarium of Doctor Parnassus") /*debug*/
	drop if mpaa==0 | priority==-1
	collapse (min) mpaa, by (imdbtitle year)
	compress
	lab val mpaa mpaa
	sort imdbtitle
	list if regexm(imdbtitle,"Imaginarium of Doctor Parnassus") /*debug*/
	save $bigstuff/mpaa.dta, replace
}

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*Oscar noms + wins
*Past noms variables
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
if `pnomimport'==1 {
	*Oscars data isn't in the FTP version of IMDb, coded from Oscars.org
	*this batch of code:
	* 1. combines recent Oscars data w older data from Rossman, Esparza, and Bonacich 2010 data
	* 2. recodes the data to create noms to-date for workers, merge them onto films.
	insheet using $recent_oscars, clear
	replace filmtitle=regexs(2)+", "+regexs(1) if regexm(filmtitle,"(A) (.+)")
	gen imdbtitle=filmtitle+" ("+filmyear+")"
	replace imdbtitle="Squid and the Whale, The (2005)" if imdbtitle=="The Squid and the Whale (2005)"
	replace imdbtitle="Crash (2004)" if imdbtitle=="Crash (2004/I)"
	replace imdbtitle="I'm Not There. (2007)" if imdbtitle=="I'm Not There (2007)"
	replace imdbtitle="Mrs. Henderson Presents (2005)" if imdbtitle=="Mrs Henderson Presents (2005)"
	replace imdbtitle="A Serious Man (2009)" if imdbtitle=="Serious Man, A (2009)"
	replace imdbtitle="Constant Gardener, The (2005)" if imdbtitle=="The Constant Gardener (2005)"
	replace imdbtitle="A Single Man (2009)" if imdbtitle=="Single Man, A (2009)"
	save $bigstuff/recent_oscars.dta, replace
	*nominee var means redundant at j level
	drop nominee
	contract _all
	drop _freq filmtitle filmyear
	save $bigstuff/recent_oscars_jcat.dta, replace
	drop category
	gen noms_j=1
	ren winner wins_j
	collapse (sum) noms_j wins_j, by (imdbtitle)
	sort imdbtitle
	save $bigstuff/recent_oscars_j.dta, replace

	*load each profession file, then "keep if" and append

	*best picture
	use $bigstuff/recent_oscars.dta, clear
	keep if category=="Best Picture"
	drop category
	append using "$oscarsfolder/bestpicture.dta"
	replace year=real(regexs(1)) if year==. & regexm(filmyear,"([0-9]+)")==1
	replace bp_nom=1 if winner!=.
	replace bp_oscar=1 if winner==1
	recode bp_oscar .=0
	replace imdbtitle=film if imdbtitle==""
	drop oscaryear filmtitle filmyear film winner
	compress
	replace year=1985 if imdbtitle=="Room with a View, A (1985)"
	save $bigstuff/bp.dta, replace
	drop nominee
	contract _all
	drop _freq
	save $bigstuff/bp_j.dta, replace

	*directors
	use $bigstuff/recent_oscars.dta, clear
	keep if category=="Best Director"
	drop category
	ren nominee director
	ren imdbtitle film
	append using "$oscarsfolder/directors/directors_gabey.dta"
	ren film imdbtitle
	replace Fdirector_nom=1 if winner!=.
	replace Fdirector_oscar=1 if winner==1
	recode Fdirector_oscar .=0
	replace year=real(regexs(1)) if year==. & regexm(filmyear,"([0-9]+)")==1
	gen altyear=real(regexs(1)) if regexm(imdbtitle,"[^\(]+\(([0-9][0-9][0-9][0-9])")
	drop if altyear!=year
	drop winner filmtitle filmyear oscaryear altyear
	save $bigstuff/directors.dta, replace
	drop if Fdirector_nom==0
	contract imdbtitle Fdirector_nom Fdirector_oscar year
	drop _freq
	drop if Fdirector_oscar==0 & imdbtitle=="Brokeback Mountain (2005)"
	sort imdbtitle
	save $bigstuff/directors_j.dta, replace

	*writers
	use $bigstuff/recent_oscars.dta, clear
	keep if category=="Best Writing Written Directly for Screen" | category=="Best Writing Previously Produced or Published"
	drop category
	ren nominee writer
	save $bigstuff/writers.dta, replace
	insheet using $oscarsfolder/writers/woscarnoms.txt, tab clear
	keep if confirm==1
	drop confirm
	tempfile writers
	save `writers', replace
	ren film imdbtitle
	append using $bigstuff/writers.dta /* 2005-2010 */
	gen Fwriter_nom=1
	replace fwriter_oscar=winner if fwriter_oscar==.
	drop winner
	replace year=real(regexs(1)) if year==. & regexm(imdbtitle,"[\()]([0-9]+)")==1
	drop oscaryear filmtitle filmyear
	sort imdbtitle
	save $bigstuff/writers.dta, replace
	contract fwriter_oscar Fwriter_nom imdbtitle year
	drop _freq
	sort imdbtitle
	save $bigstuff/writers_j.dta, replace

	*actors
	use $bigstuff/recent_oscars.dta, clear
	keep if category=="Best Actor Leading Role" | category=="Best Actor Supporting Role" | category=="Best Actress Leading Role" | category=="Best Actress Supporting Role"
	gen year=real(regexs(1)) if regexm(filmyear,"([0-9][0-9][0-9][0-9])")
	ren nominee name
	drop filmyear filmtitle oscaryear
	save $bigstuff/actors.dta, replace
	use "$oscarsfolder/oscar_noms4.18.06.dta", clear
	ren film imdbtitle
	decode category, gen(category_s)
	drop compid category
	ren category_s category
	ren actor_oscar winner
	append using $bigstuff/actors
	replace category="Best Actor" if category=="Best Actor Leading Role"
	replace category="Best Actress" if category=="Best Actress Leading Role"
	replace category="Supporting Actor" if category=="Best Actor Supporting Role"
	replace category="Supporting Actress" if category=="Best Actress Supporting Role"
	recode actor_nom .=1
	contract _all
	drop _freq
	save $bigstuff/actors.dta, replace
	collapse (sum) winner actor_nom, by(imdbtitle year)
	ren winner actor_wins
	ren actor_nom actor_noms
	sort imdbtitle
	save $bigstuff/actors_j.dta, replace

	*attach all noms / wins
	use imdb19752010_mk, clear
	sort imdbtitle year
	merge 1:1 imdbtitle year using $bigstuff/actors_j
	tab _merge if year<`prestart'
	drop _merge
	sort imdbtitle year
	merge 1:1 imdbtitle year using $bigstuff/bp_j
	tab _merge if year<`prestart'
	drop _merge
	sort imdbtitle year
	merge 1:1 imdbtitle year using $bigstuff/directors_j
	tab _merge if year<`prestart'
	drop _merge
	sort imdbtitle year
	merge 1:1 imdbtitle year using $bigstuff/writers_j
	tab _merge if year<`prestart'
	drop _merge
	foreach var in actor_wins actor_noms bp_oscar bp_nom Fdirector_nom Fdirector_oscar fwriter_oscar Fwriter_nom {
		recode `var' .=0
	}
	sum
	desc
	save $bigstuff/oscarsnomswins, replace

	lab var eligible "dummy AMPAS nom ballot"
	lab var sstitle "title in StudioSystem"
	lab var actor_wins "Count of acting wins, across sub-cats"
	lab var actor_noms "Count of acting noms, across sub-cats"
	lab var bp_oscar "Best Picture win"
	lab var bp_nom "Best Picture nom"
	lab var Fdirector_nom "Director nom"
	lab var Fdirector_oscar "Director win"
	lab var fwriter_oscar "Writing win"
	lab var Fwriter_nom  "Writing nom"

	gen totalwins=actor_wins+bp_oscar+Fdirector_oscar+fwriter_oscar
	gen totalnoms=actor_noms+bp_nom+Fdirector_nom+Fwriter_nom    /*notation: kappa_j */
	gen pl_totwins=ln(totalwins+(totalwins^2+1)^0.5)
	gen pl_totnoms=ln(totalnoms+(totalnoms^2+1)^0.5)
	lab var pl_totwins "pseudo-log(total wins)" /*"pl" or "pseudo-log" is our shorthand for inverse hyperbolic sine*/
	lab var pl_totnoms "pseudo-log(total noms)"
	save imdb19752010_oscars, replace
	drop if eligible==0 | eligible==.a
	desc
	compress imdbtitle
	save $bigstuff/imdb19752010_eligible_oscars, replace

	*=-=-=-=-=-=-=-=-=-=-=-=-
	*directors
	*=-=-=-=-=-=-=-=-=-=-=-=-
	*create running tally to date of oscar noms
	*create annual files
	forvalues t=1927/`enddate' {
	use $bigstuff/directors.dta, clear
		keep if year==`t' & Fdirector_nom==1
		compress
		sort director
		gen noms=1
		collapse (sum) nom , by (director)
		sort director
		compress
		save $bigstuff/annual/director_`t'.dta, replace
	}
	*create to-date director CVs
	clear
	set obs 1
	gen director=""
	gen directornomsum=.
	gen year=.
	lab var directornomsum "n nominations to date"
	sort director
	forvalues t=1927/`enddate' {
		merge director using $bigstuff/annual/director_`t'.dta
		drop _merge
		recode noms .=0
		recode directornomsum .=0
		replace directornomsum=directornomsum+noms
		drop noms
		replace year=`t'+1
		lab var year "year +1 (for lag)"
		sort director
		save $bigstuff/annual/directorsum_`t'.dta, replace
	}
	clear
	*create single director-year file
	forvalues t=1927/`enddate' {
		append using $bigstuff/annual/directorsum_`t'.dta
	}
	drop if director==""
	sort director year
	save $bigstuff/directorsum_allt.dta, replace

	*get raw data
	import excel "$rawimdbfiles/directors_1970andup_noTVG.xlsx", sheet("directors_1970andup_noTVG.txt") firstrow clear
	drop C /*notes about what the person contributed or under what alias*/
	ren Titles imdbtitle_raw
	ren Name director
	merge m:1 imdbtitle_raw using $parentpath/rawmerge2
	drop if _merge==1
	*note, not "keep if _merge==3"
	drop _merge imdbtitle_raw
	save $bigstuff/directorcredits, replace
	*combine with the director-film file
	*note that the merge is lagged one year by merging the current t onto t+1
	use $bigstuff/directorsum_allt.dta, clear
	merge 1:m director year using $bigstuff/directorcredits
	drop if imdbtitle==""
	drop imdbtitle_merge _merge
	recode directornomsum .=0
	collapse (sum) directornomsum, by ( year imdbtitle)
	lab var directornomsum "n nominations on director's cv by t"
	order imdbtitle year directornomsum
	compress
	sort imdbtitle
	save $bigstuff/pastnoms_d.dta, replace

	*=-=-=-=-=-=-=-=-=-=-=-=-
	*writers
	*=-=-=-=-=-=-=-=-=-=-=-=-
	*create running tally to date of oscar noms
	*create annual files
	forvalues t=1927/`enddate' {
		use $bigstuff/writers.dta, clear
		keep if year==`t' & Fwriter_nom==1
		compress
		sort writer
		gen noms=1
		collapse (sum) nom , by (writer)
		sort writer
		compress
		save $bigstuff/annual/writer_`t'.dta, replace
	}
	*create to-date writer CVs
	clear
	set obs 1
	gen writer=""
	gen writernomsum=.
	gen year=.
	lab var writernomsum "n nominations to date"
	sort writer
	forvalues t=1927/`enddate' {
		merge writer using $bigstuff/annual/writer_`t'.dta
		drop _merge
		recode noms .=0
		recode writernomsum .=0
		replace writernomsum=writernomsum+noms
		drop noms
		replace year=`t'+1
		lab var year "year +1 (for lag)"
		sort writer
		save $bigstuff/annual/writersum_`t'.dta, replace
	}
	clear
	*create single writer-year file
	forvalues t=1927/`enddate' {
		append using $bigstuff/annual/writersum_`t'.dta
	}
	drop if writer==""
	sort writer year
	save $bigstuff/writersum_allt.dta, replace

	*get raw data
	import excel "$rawimdbfiles/writers_1970andup_noTVG.xlsx", sheet("writers_1970andup_noTVG.txt") firstrow clear
	ren Titles imdbtitle_raw
	ren C credit
	ren Name writer
	*drop innappropriate credits
	gen suspect=0
	gen screenplay=0
	replace screenplay=1 if regexm(credit,"screenplay")
	replace suspect=1 if regexm(credit,"episode") & regexm(imdbtitle_raw,"^_")
	replace suspect=1 if regexm(credit,"\(staff writer") & regexm(imdbtitle_raw,"^_")
	replace suspect=1 if regexm(credit,"^\(play")
	replace suspect=1 if credit=="(play)"
	replace suspect=1 if credit=="(novel)"
	replace suspect=1 if regexm(credit,"\(book")
	replace suspect=1 if regexm(credit,"\(novel")
	replace suspect=1 if regexm(credit,"\(short story")
	replace suspect=1 if regexm(credit,"\(short stories")
	gen year=real(regexs(2)) if regexm(imdbtitle_raw,"(.+) \(([0-9][0-9][0-9][0-9])")
	*drop if year>2006
	gen credityear=real(regexs(1)) if regexm(credit,"\(([0-9][0-9][0-9][0-9]) screenplay")
	drop if credit=="(earlier screenplay)"
	drop if year>credityear+3    /*dropped rather than "suspect" because "YYYY screenplay" is common credit for writer of the original on a remake */
	tab suspect screenplay
	drop if suspect==1 & screenplay==0

	if 1==1 {
		*drop "uncredited" and similar credits, but not "originally uncredited"
		gen uncredited=0
		replace uncredited=1 if regexm(credit,"\(uncredited\)")
		replace uncredited=1 if credit=="(uncredited)"
		replace uncredited=1 if credit=="(additional dialogue)"
		replace uncredited=1 if credit=="(characters)"
		replace uncredited=1 if credit=="(book)"
		replace uncredited=1 if credit=="(contributing writer) (uncredited)"
		replace uncredited=1 if credit=="(story) (uncredited)"
		replace uncredited=1 if credit=="(continuity)"
		replace uncredited=1 if credit=="(character)"
		replace uncredited=1 if credit=="(contributor to screenplay construction) (uncredited)"
		replace uncredited=1 if credit=="(additional dialogue) (uncredited)"
		replace uncredited=1 if credit=="(screenplay) (uncredited)"
		replace uncredited=1 if credit=="(contributor to treatment) (uncredited)"
		replace uncredited=1 if credit=="(short story)"
		replace uncredited=1 if credit=="(adaptation) (uncredited)"
		replace uncredited=1 if credit=="(novel) (uncredited)"
		replace uncredited=1 if credit=="(dialogue) (uncredited)"
		replace uncredited=1 if credit=="(contributor to dialogue) (uncredited)"
		replace uncredited=1 if credit=="(additional material)"
		replace uncredited=1 if credit=="(comic strip characters)"
		replace uncredited=1 if credit=="(play) (uncredited)"
		replace uncredited=1 if credit=="(writer) (uncredited)"
		replace uncredited=1 if credit=="(titles) (uncredited)"
		replace uncredited=1 if credit=="(characters) (uncredited)"
		replace uncredited=1 if credit=="(comic strip)"
		replace uncredited=1 if credit=="(article)"
		replace uncredited=1 if credit=="(musical play)"
		replace uncredited=1 if credit=="(dialogue continuity)"
		replace uncredited=1 if credit=="(story and continuity)"
		replace uncredited=1 if credit=="(television series)"
		replace uncredited=1 if credit=="(uncredited) (unconfirmed)"
		replace uncredited=1 if credit=="(additional story material)"
		replace uncredited=1 if credit=="(poem)"
		replace uncredited=1 if credit=="(comic book)"
		replace uncredited=1 if credit=="(Liberty Magazine serial chapter)"
		replace uncredited=1 if credit=="(treatment contributor) (uncredited)"
		replace uncredited=1 if credit=="(contributor to screenplay) (uncredited)"
		replace uncredited=1 if credit=="(screenplay contributor) (uncredited)"
		replace uncredited=1 if credit=="(based on a screenplay by)"
		replace uncredited=1 if credit=="(treatment) (uncredited)"
		replace uncredited=1 if credit=="(television series Star Trek)"
		replace uncredited=1 if credit=="(story) (unconfirmed)"
		replace uncredited=1 if credit=="(contributing writer) (uncredited) (unconfirmed)"
		replace uncredited=1 if credit=="(contributor) (uncredited)"
		replace uncredited=1 if credit=="(additional writer: story)"
		replace uncredited=1 if credit=="(book) (uncredited)"
		replace uncredited=1 if credit=="(television series Police Squad)"
		replace uncredited=1 if credit=="(screenplay construction contributor) (uncredited)"
		replace uncredited=1 if credit=="(Marvel comic book)"
		replace uncredited=1 if credit=="(character The Bride)"
		drop if uncredited==1
		drop uncredited
	}

	*drop great authors who did not directly participate in screen adaptation
	foreach author in "Shakespeare, William (I)" "Dickens, Charles" "Sophocles" "Twain, Mark" "Austen, Jane (I)" "Dostoyevsky, Fyodor" "Tolstoy, Leo" {
		drop if writer=="`author'"
	}
	drop suspect screenplay credityear  credit D E F

	merge m:1 imdbtitle_raw using $parentpath/rawmerge2
	drop if _merge==1
	drop _merge
	compress
	save $bigstuff/writercredits, replace

	use $bigstuff/writersum_allt.dta, clear
	merge 1:m writer year using $bigstuff/writercredits
	drop if imdbtitle==""
	drop imdbtitle_merge _merge
	recode writernomsum .=0
	collapse (sum) writernomsum, by ( year imdbtitle)
	lab var writernomsum "n nominations on writer's cv by t"
	order imdbtitle year writernomsum
	compress
	sort imdbtitle
	save $bigstuff/pastnoms_w.dta, replace

	*=-=-=-=-=-=-=-=-=-=-=-=-
	*actors
	*=-=-=-=-=-=-=-=-=-=-=-=-
	*create running tally to date of oscar noms
	*create annual files
	forvalues t=1927/`enddate' {
	use $bigstuff/actors.dta, clear
		keep if year==`t' & actor_nom==1
		compress
		sort name
		gen noms=1
		collapse (sum) nom , by (name)
		sort name
		compress
		save $bigstuff/annual/actor_`t'.dta, replace
	}
	*create to-date actor CVs
	clear
	set obs 1
	gen name=""
	gen actornomsum=.
	gen year=.
	lab var actornomsum "n nominations to date"
	sort name
	forvalues t=1927/`enddate' {
		merge name using $bigstuff/annual/actor_`t'.dta
		drop _merge
		recode noms .=0
		recode actornomsum .=0
		replace actornomsum=actornomsum+noms
		drop noms
		replace year=`t'+1
		lab var year "year +1 (for lag)"
		sort name
		save $bigstuff/annual/actorsum_`t'.dta, replace
	}
	clear
	*create single actor-year file
	forvalues t=1927/`enddate' {
		append using $bigstuff/annual/actorsum_`t'.dta
	}
	drop if name==""
	sort name year
	save $bigstuff/actorsum_allt.dta, replace

	*get raw data
	clear
	gen x=.
	save $bigstuff/actorcredits, replace
	foreach file in actors_1970andup_noTVG_a actors_1970andup_noTVG_b actresses_1970andup_noTVG {
		import excel "$rawimdbfiles/`file'.xlsx", sheet("`file'.txt") firstrow allstring clear
		ren Titles imdbtitle_raw
		ren C credit
		ren Name name
		keep name imdbtitle_raw
		merge m:1 imdbtitle_raw using $parentpath/rawmerge2
		keep if _merge==3
		append using $bigstuff/actorcredits
		compress
		save $bigstuff/actorcredits, replace
	}
	drop x imdbtitle_raw  _merge
	save $bigstuff/actorcredits, replace
	*combine with the actor-film file
	*note that the merge is lagged one year by merging the current t onto t+1
	use $bigstuff/actorsum_allt.dta, clear
	merge 1:m name year using $bigstuff/actorcredits
	drop if imdbtitle==""
	drop imdbtitle_merge _merge
	recode actornomsum .=0
	collapse (sum) actornomsum, by ( year imdbtitle)
	lab var actornomsum "n nominations on director's cv by t"
	order imdbtitle year actornomsum
	compress
	sort imdbtitle
	save $bigstuff/pastnoms_a.dta, replace

	*create the big file
	use $bigstuff/pastnoms_a.dta, clear
	merge 1:1 imdbtitle using $bigstuff/pastnoms_w
	drop _merge
	merge 1:1 imdbtitle using $bigstuff/pastnoms_d
	drop _merge

	gen pl_pnom_a=ln(actornomsum+(actornomsum^2+1)^0.5)
	gen pl_pnom_w=ln(writernomsum+(writernomsum^2+1)^0.5)
	gen pl_pnom_d=ln(directornomsum+(directornomsum^2+1)^0.5)

	foreach var in actornomsum writernomsum directornomsum {
		gen `var'_dummy=`var'
		recode `var'_dummy 0=0 1/1000=1
	}

	save $bigstuff/pastnoms, replace
}

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*release dates
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
if `releasedates'==1 {
	import excel "$oscarsfolder/dataapril2010/releasedates_1970andup_noTVG.xlsx", sheet("releasedates_1970andup_noTVG.tx") clear
	ren A imdbtitle_raw
	ren B date_str
	ren C type
	gen d=date(subinstr(date_str,"USA:","",1),"DMY")
	head
	format d %td
	head
	codebook date_str
	keep if regexm(date_str,"^USA\:")
	sum d
	desc
	replace d=date(subinstr(date_str,"USA:","",1),"MY")+(runiform()*30) if d==.
	codebook imdbtitle_raw
	drop if regexm(type,"DVD premiere")
	gen festival=regexm(type,"[Ff]estival")
	tab festival , miss
	gen market=regexm(type,"[Mm]arket")
	tab market
	head
	merge m:1 imdbtitle_raw using $parentpath/rawmerge2.dta
	keep if _merge==3

	*from genre_dates_ratings
	gen goodrow=0
	foreach location in "(alternate version)" "(censored version)" "(theatrical release)" "(theatrical release) (limited)" "(silent version)" "(sound version)" "(sound version) (premiere)" "(premiere)" "(premiere) (Los Angeles, California)" "(limited)" "(complete verson)" "(Westwood, California)" "(Westwood, California) (premiere)" "(West Hollywood, California)" "(West Hollywood, California) (premiere)" "(Walnut Creek, Ca)" "(Venice, California)" "(USC premire screening)" "(USC)" "(USC) (premiere)" "(Thousand Oaks, California)" "(Thousand Oaks, California) (premiere)" "(San Fernando Valley, California)" "(South Pasadena, California) (premiere)" "(Santa Monica, California)" "(Santa Monica, California) (premiere)" "(Pasadena, California)" "(Pasadena, California) (premiere)" "(North Hollywood, California)" "(North Hollywood, California) (premiere)" "(Marina Del Rey, California)" "(Malibu, California)" "(Long Beach, California)" "(Long Beach, California) (premiere)" "(Los Angeles) (final cut)" "(Los Angeles) (premiere)" "(Los Angeles, California)" "(Los Angeles, California) (RKO roadshow version)" "(Los Angeles, California) (Redux version)" "(Los Angeles, California) (director's cut)" "(Los Angeles, California) (edited version) (premiere)" "(Los Angeles, California) (extended version)" "(Los Angeles, California) (final release version)" "(Los Angeles, California) (premeire)" "(Los Angeles, California) (premiere)" "(Los Angeles, California) (premiere) (Egyptian Theater, Hollywood)" "(La Canada, California) (premiere)" "(IMAX version)" "(IMAX version) (premiere)" "(IMAX version) (re-release)" "(Hollywood, California)" "(Hollywood, California) (premiere)" "(Hollywood, California) (re-release)" "(Hollywood, California) (silent version)" "(El Segundo, California)" "(Century City, California)" "(Century City, California) (premiere)" "(Burbank, California)" "(Burbank, California) (premiere)" "(Beverly Hills, California)" "(Beverly Hills, California) (premiere)" "(Academy of Television Arts and Sciences)" "(Academy of Television Arts and Sciences) (premiere)" "(Academy of Motion Picture Arts and Sciences)" "" {
		replace goodrow=1 if type=="`location'"
	}
	drop if d==.
	sort imdbtitle goodrow
	compress
	collapse (min) d, by(imdbtitle goodrow)
	replace d=. if goodrow==0 & imdbtitle==imdbtitle[_n+1] & goodrow[_n+1]==1
	replace d=14289 if imdbtitle=="Windhorse (1998)"
	drop if d==.
	drop goodrow
	generate dm=month(d)
	generate dy=doy(d)
	mkspline dy_1 180 dy_2 320 dy_3 = dy
	lab var dy_1 "day of year, 1-180"
	lab var dy_2 "day of year, 180-320"
	lab var dy_3 "day of year, 320-365"
	sort imdbtitle
	save $bigstuff/releasedates.dta, replace
}

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*lamdba & tau
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
use $bigstuff/imdb19752010_eligible_oscars, clear
merge 1:m imdbtitle using $bigstuff/keywords.dta
keep if _merge==3
drop _merge
save $bigstuff/keywords_oscars, replace

sort imdbtitle
by imdbtitle: gen nkeywords=[_N]
gen pl_totwins_scaled=pl_totwins/(nkeywords^0.5)
gen pl_totnoms_scaled=pl_totnoms/(nkeywords^0.5)
gen weightedfreq=1/(nkeywords^0.5)
gen keywordfreq=1
collapse (sum) pl_totwins_scaled pl_totnoms_scaled weightedfreq keywordfreq, by(keyword)
lab var weightedfreq "keyword, down-weighted by how many other kw appear on same films"
* notation for keywordfreq: n_i(j)
gen keywordratio=pl_totnoms_scaled/keywordfreq
gen keywordratio_w=pl_totwins_scaled/keywordfreq
save keyword_weightedfreq.dta, replace
sort keywordratio
tail

drop pl_totwins_scaled pl_totnoms_scaled
save keyword_scores, replace  /*illustrative only, time-varying scores used for calculation*/

*create tau_j with lagged lambda_i
forvalues i=`prestart'/`enddate' {
	use $bigstuff/keywords_oscars, clear
	local past=`i'-`window'
	local i1=`i'-1
	keep if year>=`past' & year<`i'
	note: data from `past' - `i1', meant to serve as lagged for `i'
	compress
	save $bigstuff/annual/kwo_`i', replace

	use $bigstuff/keywords_oscars, clear
	keep if year==`i'
	compress
	save $bigstuff/annual/targetfile_`i', replace

	use $bigstuff/annual/kwo_`i', clear
	sum year
	sort imdbtitle
	by imdbtitle: gen nkeywords=[_N]
	gen pl_totwins_scaled=pl_totwins/(nkeywords^0.5)
	gen pl_totnoms_scaled=pl_totnoms/(nkeywords^0.5)
	gen weightedfreq=1/(nkeywords^0.5)
	gen keywordfreq=1
	collapse (sum) pl_totwins_scaled pl_totnoms_scaled weightedfreq keywordfreq, by(keyword)
	lab var weightedfreq "keyword, down-weighted by how many other kw appear on same films"
	gen keywordratio=pl_totnoms_scaled/keywordfreq /* notation: lamdba_i */
	gen keywordratio_w=pl_totwins_scaled/keywordfreq
	sum
	*histogram keywordratio
	sort keywordratio
	tail

	drop pl_totwins_scaled pl_totnoms_scaled
	save $bigstuff/annual/keyword_scores`i', replace

	merge 1:m keyword using $bigstuff/annual/targetfile_`i'
	recode keywordratio .=0 /* keywords that have newly entered usage */
	recode keywordratio_w .=0 /* keywords that have newly entered usage */
	sum year
	drop _merge
	lab var keywordratio "Oscar bait keywords" /* notation: lambda_it */
	lab var keywordratio_w "Oscar bait keywords (wins)"
	notes keywordratio : sum over keywords in film, where each keyword in film is scored as sum of pseudo-log(keyword appears in nom film) scaled by baseline keyword frequency
	sort imdbtitle
	by imdbtitle: gen nkeywords=[_N]
	drop if imdbtitle=="" /*keywords that have fallen out of use since lag*/
	save $bigstuff/annual/imdb`i'_eligible_oscars_kw, replace

	collapse (sum) keywordratio keywordratio_w, by (imdbtitle year eligible sstitle actor_wins actor_noms bp_oscar bp_nom Fdirector_nom Fdirector_oscar fwriter_oscar Fwriter_nom totalwins totalnoms pl_totwins pl_totnoms nkeywords)
	lab var keywordratio "tau_j - sum of Oscar association for keywords" /* tau_j */
	lab var keywordratio_w "sum of Oscar association for keywords (wins)"
	save $bigstuff/annual/imdb`i'_eligible_oscars, replace
}

clear
forvalues i=`prestart'/`enddate' {
	append using $bigstuff/annual/imdb`i'_eligible_oscars
}
save $bigstuff/imdb_eligible_oscars_lambda_it, replace

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*do tau for genre
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
use $bigstuff/imdb19752010_eligible_oscars, clear
merge 1:m imdbtitle using $bigstuff/genre_long.dta
keep if _merge==3
drop _merge
save $bigstuff/genres_oscars, replace

sum year
sort imdbtitle
by imdbtitle: gen ngenres=[_N]
gen pl_totwins_scaled=pl_totwins/(ngenres^0.5)
gen pl_totnoms_scaled=pl_totnoms/(ngenres^0.5)
gen weightedfreq=1/(ngenres^0.5)
gen genrefreq=1
collapse (sum) pl_totwins_scaled pl_totnoms_scaled weightedfreq genrefreq, by(genre)
lab var genrefreq "genre, down-weighted by how many other genre appear on same films"
gen genreratio=pl_totnoms_scaled/genrefreq
gen genreratio_w=pl_totwins_scaled/genrefreq
sum
*histogram keywordratio
sort genreratio
tail

drop pl_totwins_scaled pl_totnoms_scaled
save $bigstuff/genre_scores, replace  /*illustrative only, time-varying scores used for calculation*/

*create tau_j with lagged lambda_i
forvalues i=`prestart'/`enddate' {
	use $bigstuff/genres_oscars, clear
	local past=`i'-`window'
	local i1=`i'-1
	keep if year>=`past' & year<`i'
	note: data from `past' - `i1', meant to serve as lagged for `i'
	compress
	save $bigstuff/annual/go_`i', replace
	* "go"="genre + oscars"

	use $bigstuff/genres_oscars, clear
	keep if year==`i'
	compress
	save $bigstuff/annual/targetfile_g_`i', replace

	use $bigstuff/annual/go_`i', clear
	sum year
	sort imdbtitle
	by imdbtitle: gen ngenres=[_N]
	gen pl_totwins_scaled=pl_totwins/(ngenres^0.5)
	gen pl_totnoms_scaled=pl_totnoms/(ngenres^0.5)
	gen weightedfreq=1/(ngenres^0.5)
	gen genrefreq=1
	collapse (sum) pl_totwins_scaled pl_totnoms_scaled weightedfreq genrefreq, by(genre)
	lab var genrefreq "genre, down-weighted by how many other genre appear on same films"
	gen genreratio=pl_totnoms_scaled/genrefreq /* notation: lamdba_i */
	gen genreratio_w=pl_totwins_scaled/genrefreq
	sum
	*histogram keywordratio
	sort genreratio
	tail

	drop pl_totwins_scaled pl_totnoms_scaled
	save $bigstuff/annual/genre_scores`i', replace

	merge 1:m genre using $bigstuff/annual/targetfile_g_`i'
	recode genreratio .=0 /* keywords that have newly entered usage */
	recode genreratio_w .=0 /* keywords that have newly entered usage */
	sum year
	drop _merge
	sort imdbtitle
	by imdbtitle: gen ngenres=[_N]
	drop if imdbtitle=="" /*genres that have fallen out of use since lag*/
	save $bigstuff/annual/imdb`i'_eligible_oscars_g, replace

	collapse (sum) genreratio genreratio_w, by (imdbtitle year eligible sstitle actor_wins actor_noms bp_oscar bp_nom Fdirector_nom Fdirector_oscar fwriter_oscar Fwriter_nom totalwins totalnoms pl_totwins pl_totnoms ngenres)
	lab var genreratio "sum of Oscar association for keywords" /*notation: tau_j */
	save $bigstuff/annual/imdb`i'_eligible_oscars_g, replace
}

clear
forvalues i=`prestart'/`enddate' {
	append using $bigstuff/annual/imdb`i'_eligible_oscars_g
}
save $bigstuff/imdb_eligible_oscars_lambda_g_it, replace

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
*stack the enchilada
*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
use eligible_19752010.dta, clear
*attach lambda -- keyword
merge 1:1 imdbtitle using $bigstuff/imdb_eligible_oscars_lambda_it
drop _merge
*attach lambda -- genre
merge 1:1 imdbtitle using $bigstuff/imdb_eligible_oscars_lambda_g_it
drop _merge
*attach genre
merge 1:1 imdbtitle using $bigstuff/genre.dta
drop _merge
compress
desc
*attach past noms variables
merge 1:1 imdbtitle using $bigstuff/pastnoms
drop _merge
*attach distributor (major coding only, not string for dist)
merge 1:1 imdbtitle using $bigstuff/distributors
drop _merge
*attach mpaa
merge 1:1 imdbtitle using $bigstuff/mpaa
drop _merge
*attach film noms
merge 1:1 imdbtitle using $bigstuff/imdb19752010_eligible_oscars, update
drop _merge
*attach release dates
merge 1:1 imdbtitle using $bigstuff/releasedates, update
drop _merge
*clean up variables
recode mpaa .=5
recode major .=0
recode indymajor .=0
recode pl_totnoms .=0
recode pl_totwins .=0
drop if imdbtitle==""
gen mpaa_r=mpaa
recode mpaa_r 0/3=0 4=1 5/100=0
save $bigstuff/enchilada, replace

desc
sum
tab year

*have a nice day

The second do-file takes the output of the first and uses it to calculate Oscar appeal as the predicted values of a regression based on lagged data.

*predictedvalues.do
*Stata do-file for Rossman & Schilke "Close But No Cigar"
*Please direct any questions to rossman@soc.ucla.edu

*This do-file must be run after mergeorama.do
*It calculate the "Oscar appeal" variable at the center of the paper as the predicted values from a regression over rolling windows with lagged betas
* (ie, the data point being predicted is not included in the estimation of the betas used to calculate it)

*Our file structure is unfortunately somewhat complicated but we always refer to paths through globals. There are no hard-coded paths so changing these globals to fit your own file system should solve everything
global parentpath "~/Dropbox/oscarsperformativity/studiosystem"
global oscarsfolder "~/Documents/oscars"
global bigstuff "$oscarsfolder/studiosystemoverflow"
global recent_dir "$oscarsfolder/dataapril2010/maria"
global recent_eligibles "$recent_dir/filmtitles_Apr2010.txt"
global recent_oscars "$recent_dir/oscars20062010.txt"
global tables "$parentpath/tables"

capture log close
log using $parentpath/predictedvalues.log, replace

*These switches turn batches of code on and off. This both facilitates code-folding and debugging. For final run though make sure everything is turned on
local window 5
local startdate 1985
local enddate 2009

local mpaalist "mpaa_r"
local pastnoms "actornomsum_dummy writernomsum_dummy directornomsum_dummy"
local tauspec "gr_* kr_*"
local majorspec "major indymajor"
local datespec "dy_1 dy_2 dy_3"
local xvector "`majorspec' `tauspec' `pastnoms' `mpaalist' `datespec' `fyc' `rtscore'"
disp "`xvector'"

capture program drop tauspline
program define tauspline
	quietly sum genreratio, detail
	mkspline gr_lowmed `r(p50)' gr_high=genreratio
	quietly sum keywordratio, detail
	mkspline kr_lowmed `r(p50)' kr_high=keywordratio
end

use $bigstuff/enchilada, clear /* load the data cleaned by mergeorama.do */
local gooddatavarlist "major indymajor keywordratio genreratio actornomsum_dummy writernomsum_dummy directornomsum_dummy mpaa_r dy_1 dy_2 dy_3"
sum `gooddatavarlist'
foreach var in `gooddatavarlist' {
	disp "`var'"
	drop if `var'==.
	disp "-----"
}
keep imdbtitle
save $bigstuff/gooddatalist.dta, replace

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
*estimation with the full sample, across all years
*this estimation is not used for calculations of Oscar Appeal, but just for parsimonious illustration in Table 1
use $bigstuff/enchilada, clear
sum `gooddatavarlist'
merge m:1 imdbtitle using $bigstuff/gooddatalist
keep if _merge==3
drop _merge
tauspline
sum `xvector'
tab year

*table 1
nbreg totalnoms `xvector', iterate(500)
esttab using $tables/nbregtotalnoms_pooled.txt, se b(3) se(3) scalars(ll) nodepvars nomtitles label title(NEGATIVE BINOMIAL REGRESSION OF TOTAL NOMINATIONS) replace fixed

*=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
*estimations based on rolling time windows
*these are used to calculate Oscar appeal
eststo clear
forvalues i=`startdate'/`enddate' { /*loop over years*/
	disp "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-" _n "`i'" _n "=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-"
	use $bigstuff/enchilada, clear
	local past=`i'-`window'
	keep if year>=`past' & year<=`i' /*keep only the focal year and it's recent past*/
	gen totnoms=totalnoms /*make a backup of Y before recoding to missing for "predict" purposes*/
	replace totalnoms=. if year==`i' /*recode the focal year to missing (so as to not fit on noise for y-hat betas)*/
	tauspline
	disp _n _n "nbreg total number of nominations ..."
	nbreg totalnoms  `xvector', iterate(500)
	eststo
	predict totalnoms_hat_xb, xb /*calculate Oscar Appeal*/
	predict totalnoms_hat
	estat ic
	keep if year==`i' /*keep only the focal year (ie, drop the predictions for it's recent past)*/
	save $bigstuff/annual/predictions`i', replace
}

*table of time-varying betas, summarized in Appendix C
esttab using $tables/oscarbait.txt, se b(3) se(3) scalars(ll) nodepvars  label title(NEGATIVE BINOMIAL REGRESSION OF OSCAR NOMS BY 5 YEAR WINDOWS) replace fixed nomtitles

*stitch together the annual Oscar appeal files, combine w covariates
clear
forvalues i=`startdate'/`enddate' {
	append using $bigstuff/annual/predictions`i'
}
drop bp_oscar bp_nom totalwins totalnoms _est_est*
save $parentpath/predictions, replace

use $bigstuff/enchilada, clear
keep bp_oscar bp_nom imdbtitle
merge 1:1 imdbtitle using $parentpath/predictions
drop _merge
lab var totalnoms_hat_xb "Oscar Appeal"
save $parentpath/predictions, replace

sum

*have a nice day
About these ads

Entry filed under: Uncategorized. Tags: , , .

Gifts Make Slaves, Whips Make Dogs, and Covenant With Yahweh Makes Charity Bargaining With Assholes

5 Comments

  • 1. Jim  |  August 1, 2013 at 1:17 pm

    Congratulations on a fine paper. I was hoping for this outcome. Will look forward to reading the final version.

  • 3. kavya  |  September 4, 2013 at 11:34 am

    can u plz send me complete perl code for my project :
    Comparison of term frequency and document frequency based feature selection methods in text categorization

  • 5. kavya  |  September 4, 2013 at 11:43 am

    sir plzzz help me . i have to sub met my project and save my life sir.

    plz send me complete perl code for my project :
    Comparison of term frequency and document frequency based feature selection methods in text categorization


The Culture Geeks

Recent Posts


Follow

Get every new post delivered to your Inbox.

Join 1,479 other followers

%d bloggers like this: