First, we load the House of Commons Corpus from the ParlSpeech Dataset

Corp_HouseOfCommons_V2 <- readRDS("../data/Corp_HouseOfCommons_V2.rds")

We keep only speeches where the speaker is not the chair and is a member of the largest party.

We also transform the party variable into a factor and transform the date (including the creation of a year variable) since we will need both later.

corp_hoc <- Corp_HouseOfCommons_V2 %>%
  filter(chair==F) %>%
  filter(party %in% c("Con","Lab")) %>%
  mutate(date=as.Date(date)) %>% 
  mutate(year=lubridate::year(date)) %>%
  mutate(party=as.factor(party))

corpus_hoc <- corpus(corp_hoc)

Overview: How much data do we have over time?

corp_hoc %>% 
  ggplot()+
  geom_histogram(aes(x=date),stat="count")+
  facet_wrap(~party)
## Warning: Ignoring unknown parameters: binwidth, bins, pad

To make the task more manageable on a normal computer, we draw a sample of 4000 press releases per year. You may want to download this sample directly: sample_corp.RData

sample_corp <- corpus_hoc %>% 
  corpus_sample(4000,by=year)
save(sample_corp,file="../data/sample_corp.RData")

If you downloaded the sample corpus, load it and start here. Now, we extract data for one year from our dfm.

load("../data/sample_corp.RData")
corpus_hoc_2019 <- corpus_subset(sample_corp,year==2019)

Now, we’re ready to create a document feature matrix. You may choose to remove punctuation or do similar pre-processing.

Then, split the data into a test and a training set.

dfm_hoc <- corpus_hoc_2019 %>% tokens() %>% dfm(remove_punct=T)
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
hoc_train <- dfm_sample(dfm_hoc,0.8*ndoc(dfm_hoc))
hoc_test <- dfm_subset(dfm_hoc,!(docnames(dfm_hoc) %in% docnames(hoc_train)))

We will talk about this again later but we may choose to remove features that are very rare or frequent in order to reduce the number of features.

hoc_train <- dfm_trim(hoc_train,
                        min_docfreq=0.01,
                        max_docfreq=0.8,
                        docfreq_type="prop",
                        verbose=T)
## Removing features occurring:
##   - fewer than 1 time: 2,675
##   - in fewer than 32 documents: 23,496
##   - in more than 2560 documents: 5
##   Total features removed: 23,501 (94.4%).

Adjust the features of the test dfm to the training dfm

hoc_test <- dfm_match(hoc_test,featnames(hoc_train))

It’s time to train the model, predict & see how well we did.

nb_model<-textmodel_nb(hoc_train,docvars(hoc_train,
  "party"))
nb_model
## 
## Call:
## textmodel_nb.dfm(x = hoc_train, y = docvars(hoc_train, "party"))
## 
##  Distribution: multinomial ; priors: 0.5 0.5 ; smoothing value: 1 ; 3200 training documents;  fitted features.
preds <- predict(nb_model,hoc_test)
table(preds,docvars(hoc_test,"party"))
##      
## preds Con Lab
##   Con 412  94
##   Lab 115 179
# confusion matrix:
caret::confusionMatrix(preds,docvars(hoc_test,"party"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Con Lab
##        Con 412  94
##        Lab 115 179
##                                           
##                Accuracy : 0.7388          
##                  95% CI : (0.7068, 0.7689)
##     No Information Rate : 0.6587          
##     P-Value [Acc > NIR] : 6.621e-07       
##                                           
##                   Kappa : 0.4295          
##                                           
##  Mcnemar's Test P-Value : 0.1665          
##                                           
##             Sensitivity : 0.7818          
##             Specificity : 0.6557          
##          Pos Pred Value : 0.8142          
##          Neg Pred Value : 0.6088          
##              Prevalence : 0.6587          
##          Detection Rate : 0.5150          
##    Detection Prevalence : 0.6325          
##       Balanced Accuracy : 0.7187          
##                                           
##        'Positive' Class : Con             
## 

The effect of pre-processing

Repeat what we have done above but use different trimming thresholds - look at both the accuracy and the speed to get a feeling for the usefulness of pre-processing and the impact it may have on your predictions.

# redo both so we could also choose lower cleaning thresholds
hoc_train <-dfm_subset(dfm_hoc,(docnames(dfm_hoc) %in% docnames(hoc_train))) 
hoc_test <- dfm_subset(dfm_hoc,!(docnames(dfm_hoc) %in% docnames(hoc_train)))

start_05 <- Sys.time()
hoc_train_05 <- dfm_trim(hoc_train,
                        min_docfreq=0.05,
                        max_docfreq=0.8,
                        docfreq_type="prop",
                        verbose=T)
## Removing features occurring:
##   - fewer than 1 time: 2,675
##   - in fewer than 160 documents: 24,597
##   - in more than 2560 documents: 5
##   Total features removed: 24,602 (98.8%).
hoc_test_05 <- dfm_match(hoc_test,featnames(hoc_train_05))

nb_model_05<-textmodel_nb(hoc_train_05,docvars(hoc_train_05,
  "party"))
preds_05 <- predict(nb_model_05,hoc_test_05)
end_05 <- Sys.time()
end_05-start_05
## Time difference of 0.03900909 secs
caret::confusionMatrix(preds_05,docvars(hoc_test,"party"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Con Lab
##        Con 386  94
##        Lab 141 179
##                                           
##                Accuracy : 0.7062          
##                  95% CI : (0.6733, 0.7376)
##     No Information Rate : 0.6587          
##     P-Value [Acc > NIR] : 0.002355        
##                                           
##                   Kappa : 0.3727          
##                                           
##  Mcnemar's Test P-Value : 0.002694        
##                                           
##             Sensitivity : 0.7324          
##             Specificity : 0.6557          
##          Pos Pred Value : 0.8042          
##          Neg Pred Value : 0.5594          
##              Prevalence : 0.6587          
##          Detection Rate : 0.4825          
##    Detection Prevalence : 0.6000          
##       Balanced Accuracy : 0.6941          
##                                           
##        'Positive' Class : Con             
## 
start_50 <- Sys.time()
hoc_train_50 <- dfm_trim(hoc_train,
                        min_docfreq=0.50,
                        max_docfreq=0.8,
                        docfreq_type="prop",
                        verbose=T)
## Removing features occurring:
##   - fewer than 1 time: 2,675
##   - in fewer than 1600 documents: 24,877
##   - in more than 2560 documents: 5
##   Total features removed: 24,882 (100.0%).
hoc_test_50 <- dfm_match(hoc_test,featnames(hoc_train_50))

nb_model_50<-textmodel_nb(hoc_train_50,docvars(hoc_train_50,
  "party"))
preds_50 <- predict(nb_model_50,hoc_test_50)
end_50 <- Sys.time()
end_50-start_50
## Time difference of 0.03200817 secs
caret::confusionMatrix(preds_50,docvars(hoc_test,"party"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Con Lab
##        Con 353 110
##        Lab 174 163
##                                           
##                Accuracy : 0.645           
##                  95% CI : (0.6107, 0.6782)
##     No Information Rate : 0.6587          
##     P-Value [Acc > NIR] : 0.8047130       
##                                           
##                   Kappa : 0.2526          
##                                           
##  Mcnemar's Test P-Value : 0.0001852       
##                                           
##             Sensitivity : 0.6698          
##             Specificity : 0.5971          
##          Pos Pred Value : 0.7624          
##          Neg Pred Value : 0.4837          
##              Prevalence : 0.6587          
##          Detection Rate : 0.4412          
##    Detection Prevalence : 0.5787          
##       Balanced Accuracy : 0.6334          
##                                           
##        'Positive' Class : Con             
## 

Implementing this at scale (advanced)

To really implement a substantive evaluation of accuracy metrics like Peterson and Spirling, we will have to repeat this procedure several times. For that, it is easiest to write a function and use a loop or an apply command.

Based on your previous code, write a function that calculates the accuracy for each year.

Your function should definitely have a variable for the year but you may want to be able to vary other things - e.g. the trimming thresholds, the size of the training and test sample, ….

Tip: You can either calculate the accuracy yourself or, if you store the caret::confusionMatrix() output into a variable confmat, you can use confmat$overall[['Accuracy']] to directly get the number as an output.

calc_yearly <- function(fyear,sample,trim_low,trim_high,testsize=0.8){
  fdfm <- corpus_subset(sample,year==fyear) %>%
    tokens() %>%
    dfm(remove_punct=T)
  fdfm_train <- dfm_sample(fdfm,testsize*ndoc(fdfm))
  fdfm_test <- dfm_subset(fdfm,!docnames(fdfm) %in% docnames(fdfm_train))
  fdfm_train <- dfm_trim(fdfm_train,
                        min_docfreq=trim_low,
                        max_docfreq=trim_high,
                        docfreq_type="prop",
                        verbose=F)
  fdfm_test <- dfm_match(fdfm_test,featnames(fdfm_train))
  nb_model<-textmodel_nb(fdfm_train,docvars(fdfm_train,
  "party"))
  preds <- predict(nb_model,fdfm_test)
  caret::confusionMatrix(preds,docvars(fdfm_test,"party"))$overall[['Accuracy']]
}

Now, run this function for each year.

yearly_accuracy <- sapply(1988:2019,calc_yearly,sample_corp,0.01,0.8)
## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.

## Warning: '...' should not be used for tokens() arguments; use 'tokens()' first.
names(yearly_accuracy) <- 1988:2019

yearly_accuracy
##    1988    1989    1990    1991    1992    1993    1994    1995    1996    1997 
## 0.79125 0.76625 0.77625 0.75625 0.76750 0.74000 0.77500 0.74375 0.77625 0.63500 
##    1998    1999    2000    2001    2002    2003    2004    2005    2006    2007 
## 0.67000 0.72875 0.72375 0.68250 0.69875 0.68875 0.69875 0.71375 0.69125 0.72125 
##    2008    2009    2010    2011    2012    2013    2014    2015    2016    2017 
## 0.72125 0.72375 0.57125 0.69500 0.70750 0.71500 0.70875 0.69375 0.72500 0.75375 
##    2018    2019 
## 0.73375 0.74000