library(tidyverse)
library(quanteda)
library(quanteda.textmodels)

Preparations

We now have a look at the EUI Thesis abstracts to evaluate what differentiates departments.

Load the data and create a corpus.

You can filter all texts below a certain number of characters by using str_length() or by checking if the abstract variable is NA - the data is a bit messy so I recommend this to get rid of data that was scraped incompletely or is not available in cadmus.

load("../data/theses.RData")
eui_corp <- theses %>%
  dplyr::filter(str_length(abstract)>300) %>% 
  corpus(text_field="abstract")

Now, create a dfm and do whatever pre-processing you think might be useful.

eui_dfm <- tokens(eui_corp,
                  remove_punct=T,
                  remove_symbols=T,
                  split_hyphens=T) %>%
  dfm()

# Note that you could of course do more extensive preprocessing
# For example, as you may see later on, there are non-english theses in the sample
# hence, you might try to find and delete these documents using frequent french, german and italian words

Training the classifier

Using the department document variable, train a Naive Bayes and a SVM model to predict the department.

If you want, skip the train-test split here because we don’t need to see the accuracy of our model.

model_nb <- textmodel_nb(eui_dfm,docvars(eui_dfm,"department"))
model_svm <- textmodel_svm(eui_dfm,docvars(eui_dfm,"department"))

Feature evaluation

You can use coef() on the model to get a matrix of coefficients per feature and ‘class’.

One thing that is a bit tricky is that both matrixes are different: for the svm coefficients, the departments will be the first dimension, for the naive bayes model, departments are the second dimension.

This is a bit tricky, so I include the code. I recommend

  • transposing the matrix of the svm coefficients using t()(that is, making the columns into rows and vice-versa)
  • storing both matrixes as a data frame
  • storing the rownames in a new variable called feature
coefs_svm <- coef(model_svm) %>%
  t() %>% 
  data.frame()%>%
  mutate(feature=rownames(.))

coefs_nb <- coef(model_nb) %>% 
  data.frame() %>%
  mutate(feature=rownames(.))

Now, try to find the most predictive features for two of the departments. You can do this for example by using arrange() on the dataframe - but there are other solutions, this is just a data management task.

You should see a difference between the results for both classifiers that is due to the way they reach their predictions building on the feature scores.

coefs_nb %>% 
  arrange(desc(eco_theses)) %>% 
  head() %>% 
  # no need for this line - it just makes the results nicer to read in rmarkdown
  knitr::kable(caption="NB Economy")
NB Economy
eco_theses hec_theses law_theses sps_theses feature
the 0.0563003 0.0645680 0.0695189 0.0651366 the
of 0.0318497 0.0393360 0.0441801 0.0412069 of
in 0.0227299 0.0209059 0.0187357 0.0234380 in
and 0.0208371 0.0291695 0.0262201 0.0313294 and
to 0.0183789 0.0161914 0.0197473 0.0207963 to
a 0.0168630 0.0141238 0.0146522 0.0141004 a
coefs_svm %>% 
  arrange(desc(eco_theses)) %>% 
  head() %>% 
  knitr::kable(caption="SVM Economy")
SVM Economy
sps_theses hec_theses law_theses eco_theses feature
chapter -0.1041301 -0.0413539 -0.0014945 0.1187724 chapter
models -0.1054706 -0.0716182 -0.0463843 0.1074100 models
economy -0.0514300 -0.0407055 -0.0385985 0.1069813 economy
chapters -0.1034605 0.0013019 -0.0401247 0.1027875 chapters
network -0.0007803 -0.0305924 -0.0491171 0.1002902 network
labor -0.0497055 -0.0406644 -0.0608924 0.0912014 labor
coefs_nb %>% 
  arrange(desc(sps_theses)) %>% 
  head() %>% 
  knitr::kable(caption="NB SPS")
NB SPS
eco_theses hec_theses law_theses sps_theses feature
the 0.0563003 0.0645680 0.0695189 0.0651366 the
of 0.0318497 0.0393360 0.0441801 0.0412069 of
and 0.0208371 0.0291695 0.0262201 0.0313294 and
in 0.0227299 0.0209059 0.0187357 0.0234380 in
to 0.0183789 0.0161914 0.0197473 0.0207963 to
a 0.0168630 0.0141238 0.0146522 0.0141004 a
coefs_svm %>% 
  arrange(desc(sps_theses)) %>% 
  head() %>% 
  knitr::kable(caption="SVM SPS")
SVM SPS
sps_theses hec_theses law_theses eco_theses feature
why 0.1362991 -0.0494612 -0.0176162 -0.0599172 why
contemporary 0.1197083 -0.0704051 -0.0379703 -0.0408072 contemporary
problem 0.1158665 -0.0701402 -0.0541148 -0.0212415 problem
sociology 0.1093154 -0.0818114 0.0019105 -0.0072811 sociology
argues 0.1086896 -0.0463778 -0.0159006 -0.0353079 argues
research 0.1083545 -0.0116346 0.0115430 -0.0823018 research

Try a substantive evaluation: Think about words that stand for

  • different methods
  • different formats (cumulative vs. monograph)

Have a look at the coefficients of these words for each department. Again, there are several ways to do this - you just need to select the correct row of the data frame.

# methods
coefs_nb[coefs_nb$feature=="data",]
##       eco_theses   hec_theses   law_theses   sps_theses feature
## data 0.001868209 8.559896e-05 0.0006516235 0.0009834122    data
coefs_nb[coefs_nb$feature=="regression",]
##              eco_theses   hec_theses   law_theses   sps_theses    feature
## regression 0.0001392963 6.584536e-06 6.205938e-06 6.748907e-05 regression
coefs_nb[coefs_nb$feature=="qualitative",]
##              eco_theses   hec_theses   law_theses   sps_theses     feature
## qualitative 3.27756e-05 3.950721e-05 5.585344e-05 0.0003760105 qualitative
coefs_nb[coefs_nb$feature=="narrative",]
##            eco_theses   hec_theses   law_theses   sps_theses   feature
## narrative 4.91634e-05 0.0002370433 0.0001055009 8.195101e-05 narrative
# formats
coefs_nb[coefs_nb$feature=="chapter",]
##          eco_theses  hec_theses   law_theses   sps_theses feature
## chapter 0.005883221 0.000454333 0.0008750372 0.0008098688 chapter
coefs_nb[coefs_nb$feature=="article",]
##          eco_theses   hec_theses   law_theses   sps_theses feature
## article 2.45817e-05 1.316907e-05 0.0004282097 9.159231e-05 article
coefs_nb %>% filter(feature=="article")
##          eco_theses   hec_theses   law_theses   sps_theses feature
## article 2.45817e-05 1.316907e-05 0.0004282097 9.159231e-05 article

This was not part of your task - but one way to get the relation of the coefficients clearer would be to plot them. I do this here for the SVM coefficients since they are more intuitive and useful to understand. However, the NB coefficients also become easier to interpret when you focus on the difference between the departments.

coefs_svm %>%
  filter(feature %in% c("data","regression","qualitative","narrative",
                        "empirical","evidence","model", "measures")) %>%
  ggplot(aes(x=feature))+
  geom_point(aes(y=eco_theses,color="ECO"),alpha=0.7)+
  geom_point(aes(y=sps_theses,color="SPS"),alpha=0.7)+
  geom_point(aes(y=law_theses,color="LAW"),alpha=0.7)+
  geom_point(aes(y=hec_theses,color="HEC"),alpha=0.7)+
  ylab("coefficient")+
  geom_hline(aes(yintercept=0),linetype="dashed")+
  theme_minimal()+scale_color_discrete("Department")

Finally, think about keywords for your PhD: How predictive are they of each department? Do you fit where you are? :)

coefs_nb[coefs_nb$feature=="migration",]
##            eco_theses   hec_theses   law_theses   sps_theses   feature
## migration 0.000245817 0.0002172897 0.0001427366 0.0004338583 migration
coefs_nb[coefs_nb$feature=="democracy",]
##            eco_theses   hec_theses   law_theses  sps_theses   feature
## democracy 4.09695e-05 6.584536e-05 0.0002544435 0.000930385 democracy