First, we load the House of Commons Corpus from the ParlSpeech Dataset
Corp_HouseOfCommons_V2 <- readRDS("../data/Corp_HouseOfCommons_V2.rds")
We keep only speeches where the speaker is not the chair and is a member of the largest party.
We also transform the party variable into a factor and transform the date (including the creation of a year variable) since we will need both later.
corp_hoc <- Corp_HouseOfCommons_V2 %>%
filter(chair==F) %>%
filter(party %in% c("Con","Lab")) %>%
mutate(date=as.Date(date)) %>%
mutate(year=lubridate::year(date)) %>%
mutate(party=as.factor(party))
corpus_hoc <- corpus(corp_hoc)
Overview: How much data do we have over time?
corp_hoc %>%
ggplot()+
geom_histogram(aes(x=date),stat="count")+
facet_wrap(~party)
## Warning: Ignoring unknown parameters: binwidth, bins, pad
To make the task more manageable on a normal computer, we draw a sample of 4000 press releases per year. You may want to download this sample directly: sample_corp.RData
sample_corp <- corpus_hoc %>%
corpus_sample(4000,by=year)
save(sample_corp,file="../data/sample_corp.RData")
If you downloaded the sample corpus, load it and start here. Now, we extract data for one year from our dfm.
load("../data/sample_corp.RData")
corpus_hoc_2019 <- corpus_subset(sample_corp,year==2019)
Now, we’re ready to create a document feature matrix. You may choose to remove punctuation or do similar pre-processing.
Then, split the data into a test and a training set.
dfm_hoc <- corpus_hoc_2019 %>% tokens() %>% dfm(remove_punct=T)
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
hoc_train <- dfm_sample(dfm_hoc,0.8*ndoc(dfm_hoc))
hoc_test <- dfm_subset(dfm_hoc,!(docnames(dfm_hoc) %in% docnames(hoc_train)))
We will talk about this again later but we may choose to remove features that are very rare or frequent in order to reduce the number of features.
hoc_train <- dfm_trim(hoc_train,
min_docfreq=0.01,
max_docfreq=0.8,
docfreq_type="prop",
verbose=T)
## Removing features occurring:
## - fewer than 1 time: 2,675
## - in fewer than 32 documents: 23,496
## - in more than 2560 documents: 5
## Total features removed: 23,501 (94.4%).
Adjust the features of the test dfm to the training dfm
hoc_test <- dfm_match(hoc_test,featnames(hoc_train))
It’s time to train the model, predict & see how well we did.
nb_model<-textmodel_nb(hoc_train,docvars(hoc_train,
"party"))
nb_model
##
## Call:
## textmodel_nb.dfm(x = hoc_train, y = docvars(hoc_train, "party"))
##
## Distribution: multinomial ; priors: 0.5 0.5 ; smoothing value: 1 ; 3200 training documents; fitted features.
preds <- predict(nb_model,hoc_test)
table(preds,docvars(hoc_test,"party"))
##
## preds Con Lab
## Con 412 94
## Lab 115 179
# confusion matrix:
caret::confusionMatrix(preds,docvars(hoc_test,"party"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Con Lab
## Con 412 94
## Lab 115 179
##
## Accuracy : 0.7388
## 95% CI : (0.7068, 0.7689)
## No Information Rate : 0.6587
## P-Value [Acc > NIR] : 6.621e-07
##
## Kappa : 0.4295
##
## Mcnemar's Test P-Value : 0.1665
##
## Sensitivity : 0.7818
## Specificity : 0.6557
## Pos Pred Value : 0.8142
## Neg Pred Value : 0.6088
## Prevalence : 0.6587
## Detection Rate : 0.5150
## Detection Prevalence : 0.6325
## Balanced Accuracy : 0.7187
##
## 'Positive' Class : Con
##
Repeat what we have done above but use different trimming thresholds - look at both the accuracy and the speed to get a feeling for the usefulness of pre-processing and the impact it may have on your predictions.
# redo both so we could also choose lower cleaning thresholds
hoc_train <-dfm_subset(dfm_hoc,(docnames(dfm_hoc) %in% docnames(hoc_train)))
hoc_test <- dfm_subset(dfm_hoc,!(docnames(dfm_hoc) %in% docnames(hoc_train)))
start_05 <- Sys.time()
hoc_train_05 <- dfm_trim(hoc_train,
min_docfreq=0.05,
max_docfreq=0.8,
docfreq_type="prop",
verbose=T)
## Removing features occurring:
## - fewer than 1 time: 2,675
## - in fewer than 160 documents: 24,597
## - in more than 2560 documents: 5
## Total features removed: 24,602 (98.8%).
hoc_test_05 <- dfm_match(hoc_test,featnames(hoc_train_05))
nb_model_05<-textmodel_nb(hoc_train_05,docvars(hoc_train_05,
"party"))
preds_05 <- predict(nb_model_05,hoc_test_05)
end_05 <- Sys.time()
end_05-start_05
## Time difference of 0.03900909 secs
caret::confusionMatrix(preds_05,docvars(hoc_test,"party"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Con Lab
## Con 386 94
## Lab 141 179
##
## Accuracy : 0.7062
## 95% CI : (0.6733, 0.7376)
## No Information Rate : 0.6587
## P-Value [Acc > NIR] : 0.002355
##
## Kappa : 0.3727
##
## Mcnemar's Test P-Value : 0.002694
##
## Sensitivity : 0.7324
## Specificity : 0.6557
## Pos Pred Value : 0.8042
## Neg Pred Value : 0.5594
## Prevalence : 0.6587
## Detection Rate : 0.4825
## Detection Prevalence : 0.6000
## Balanced Accuracy : 0.6941
##
## 'Positive' Class : Con
##
start_50 <- Sys.time()
hoc_train_50 <- dfm_trim(hoc_train,
min_docfreq=0.50,
max_docfreq=0.8,
docfreq_type="prop",
verbose=T)
## Removing features occurring:
## - fewer than 1 time: 2,675
## - in fewer than 1600 documents: 24,877
## - in more than 2560 documents: 5
## Total features removed: 24,882 (100.0%).
hoc_test_50 <- dfm_match(hoc_test,featnames(hoc_train_50))
nb_model_50<-textmodel_nb(hoc_train_50,docvars(hoc_train_50,
"party"))
preds_50 <- predict(nb_model_50,hoc_test_50)
end_50 <- Sys.time()
end_50-start_50
## Time difference of 0.03200817 secs
caret::confusionMatrix(preds_50,docvars(hoc_test,"party"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction Con Lab
## Con 353 110
## Lab 174 163
##
## Accuracy : 0.645
## 95% CI : (0.6107, 0.6782)
## No Information Rate : 0.6587
## P-Value [Acc > NIR] : 0.8047130
##
## Kappa : 0.2526
##
## Mcnemar's Test P-Value : 0.0001852
##
## Sensitivity : 0.6698
## Specificity : 0.5971
## Pos Pred Value : 0.7624
## Neg Pred Value : 0.4837
## Prevalence : 0.6587
## Detection Rate : 0.4412
## Detection Prevalence : 0.5787
## Balanced Accuracy : 0.6334
##
## 'Positive' Class : Con
##
To really implement a substantive evaluation of accuracy metrics like Peterson and Spirling, we will have to repeat this procedure several times. For that, it is easiest to write a function and use a loop or an apply command.
Based on your previous code, write a function that calculates the accuracy for each year.
Your function should definitely have a variable for the year but you may want to be able to vary other things - e.g. the trimming thresholds, the size of the training and test sample, ….
Tip: You can either calculate the accuracy yourself or, if you store the caret::confusionMatrix()
output into a variable confmat
, you can use confmat$overall[['Accuracy']]
to directly get the number as an output.
calc_yearly <- function(fyear,sample,trim_low,trim_high,testsize=0.8){
fdfm <- corpus_subset(sample,year==fyear) %>%
tokens() %>%
dfm(remove_punct=T)
fdfm_train <- dfm_sample(fdfm,testsize*ndoc(fdfm))
fdfm_test <- dfm_subset(fdfm,!docnames(fdfm) %in% docnames(fdfm_train))
fdfm_train <- dfm_trim(fdfm_train,
min_docfreq=trim_low,
max_docfreq=trim_high,
docfreq_type="prop",
verbose=F)
fdfm_test <- dfm_match(fdfm_test,featnames(fdfm_train))
nb_model<-textmodel_nb(fdfm_train,docvars(fdfm_train,
"party"))
preds <- predict(nb_model,fdfm_test)
caret::confusionMatrix(preds,docvars(fdfm_test,"party"))$overall[['Accuracy']]
}
Now, run this function for each year.
yearly_accuracy <- sapply(1988:2019,calc_yearly,sample_corp,0.01,0.8)
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
names(yearly_accuracy) <- 1988:2019
yearly_accuracy
## 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
## 0.79125 0.76625 0.77625 0.75625 0.76750 0.74000 0.77500 0.74375 0.77625 0.63500
## 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
## 0.67000 0.72875 0.72375 0.68250 0.69875 0.68875 0.69875 0.71375 0.69125 0.72125
## 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
## 0.72125 0.72375 0.57125 0.69500 0.70750 0.71500 0.70875 0.69375 0.72500 0.75375
## 2018 2019
## 0.73375 0.74000