Substantive Features - EUI Thesis Abstracts

library(tidyverse)
library(quanteda)
library(quanteda.textmodels)

Preparations

We now have a look at the EUI Thesis abstracts to evaluate what differentiates departments.

Load the data and create a corpus.

You can filter all texts below a certain number of characters by using str_length() or by checking if the abstract variable is NA - the data is a bit messy so I recommend this to get rid of data that was scraped incompletely or is not available in cadmus.

load("../data/theses.RData")

eui_corp <- theses %>%
  dplyr::filter(str_length(abstract)>300) %>% 
  corpus(text_field="abstract")

Now, create a dfm and do whatever pre-processing you think might be useful.

eui_dfm <- tokens(eui_corp,
                  remove_punct=T,
                  remove_symbols=T,
                  split_hyphens=T) %>%
  dfm()

# Note that you could of course do more extensive preprocessing
# For example, as you may see later on, there are non-english theses in the sample
# hence, you might try to find and delete these documents using frequent french, german and italian words

Training the classifier

Using the department document variable, train a Naive Bayes and a SVM model to predict the department.

If you want, skip the train-test split here because we don’t need to see the accuracy of our model.

model_nb <- textmodel_nb(eui_dfm,docvars(eui_dfm,"department"))
model_svm <- textmodel_svm(eui_dfm,docvars(eui_dfm,"department"))

Feature evaluation

You can use coef() on the model to get a matrix of coefficients per feature and ‘class’.

One thing that is a bit tricky is that both matrixes are different: for the svm coefficients, the departments will be the first dimension, for the naive bayes model, departments are the second dimension.

This is a bit tricky, so I include the code. I recommend

transposing the matrix of the svm coefficients using t()(that is, making the columns into rows and vice-versa)
storing both matrixes as a data frame
storing the rownames in a new variable called feature

coefs_svm <- coef(model_svm) %>%
  t() %>% 
  data.frame()%>%
  mutate(feature=rownames(.))

coefs_nb <- coef(model_nb) %>% 
  data.frame() %>%
  mutate(feature=rownames(.))

Now, try to find the most predictive features for two of the departments. You can do this for example by using arrange() on the dataframe - but there are other solutions, this is just a data management task.

You should see a difference between the results for both classifiers that is due to the way they reach their predictions building on the feature scores.

coefs_nb %>% 
  arrange(desc(eco_theses)) %>% 
  head() %>% 
  # no need for this line - it just makes the results nicer to read in rmarkdown
  knitr::kable(caption="NB Economy")

NB Economy
	eco_theses	hec_theses	law_theses	sps_theses	feature
the	0.0563003	0.0645680	0.0695189	0.0651366	the
of	0.0318497	0.0393360	0.0441801	0.0412069	of
in	0.0227299	0.0209059	0.0187357	0.0234380	in
and	0.0208371	0.0291695	0.0262201	0.0313294	and
to	0.0183789	0.0161914	0.0197473	0.0207963	to
a	0.0168630	0.0141238	0.0146522	0.0141004	a

coefs_svm %>% 
  arrange(desc(eco_theses)) %>% 
  head() %>% 
  knitr::kable(caption="SVM Economy")

SVM Economy
	sps_theses	hec_theses	law_theses	eco_theses	feature
chapter	-0.1041301	-0.0413539	-0.0014945	0.1187724	chapter
models	-0.1054706	-0.0716182	-0.0463843	0.1074100	models
economy	-0.0514300	-0.0407055	-0.0385985	0.1069813	economy
chapters	-0.1034605	0.0013019	-0.0401247	0.1027875	chapters
network	-0.0007803	-0.0305924	-0.0491171	0.1002902	network
labor	-0.0497055	-0.0406644	-0.0608924	0.0912014	labor

coefs_nb %>% 
  arrange(desc(sps_theses)) %>% 
  head() %>% 
  knitr::kable(caption="NB SPS")

NB SPS
	eco_theses	hec_theses	law_theses	sps_theses	feature
the	0.0563003	0.0645680	0.0695189	0.0651366	the
of	0.0318497	0.0393360	0.0441801	0.0412069	of
and	0.0208371	0.0291695	0.0262201	0.0313294	and
in	0.0227299	0.0209059	0.0187357	0.0234380	in
to	0.0183789	0.0161914	0.0197473	0.0207963	to
a	0.0168630	0.0141238	0.0146522	0.0141004	a

coefs_svm %>% 
  arrange(desc(sps_theses)) %>% 
  head() %>% 
  knitr::kable(caption="SVM SPS")

SVM SPS
	sps_theses	hec_theses	law_theses	eco_theses	feature
why	0.1362991	-0.0494612	-0.0176162	-0.0599172	why
contemporary	0.1197083	-0.0704051	-0.0379703	-0.0408072	contemporary
problem	0.1158665	-0.0701402	-0.0541148	-0.0212415	problem
sociology	0.1093154	-0.0818114	0.0019105	-0.0072811	sociology
argues	0.1086896	-0.0463778	-0.0159006	-0.0353079	argues
research	0.1083545	-0.0116346	0.0115430	-0.0823018	research

Try a substantive evaluation: Think about words that stand for

different methods
different formats (cumulative vs. monograph)

Have a look at the coefficients of these words for each department. Again, there are several ways to do this - you just need to select the correct row of the data frame.

# methods
coefs_nb[coefs_nb$feature=="data",]

##       eco_theses   hec_theses   law_theses   sps_theses feature
## data 0.001868209 8.559896e-05 0.0006516235 0.0009834122    data

coefs_nb[coefs_nb$feature=="regression",]

##              eco_theses   hec_theses   law_theses   sps_theses    feature
## regression 0.0001392963 6.584536e-06 6.205938e-06 6.748907e-05 regression

coefs_nb[coefs_nb$feature=="qualitative",]

##              eco_theses   hec_theses   law_theses   sps_theses     feature
## qualitative 3.27756e-05 3.950721e-05 5.585344e-05 0.0003760105 qualitative

coefs_nb[coefs_nb$feature=="narrative",]

##            eco_theses   hec_theses   law_theses   sps_theses   feature
## narrative 4.91634e-05 0.0002370433 0.0001055009 8.195101e-05 narrative

# formats
coefs_nb[coefs_nb$feature=="chapter",]

##          eco_theses  hec_theses   law_theses   sps_theses feature
## chapter 0.005883221 0.000454333 0.0008750372 0.0008098688 chapter

coefs_nb[coefs_nb$feature=="article",]

##          eco_theses   hec_theses   law_theses   sps_theses feature
## article 2.45817e-05 1.316907e-05 0.0004282097 9.159231e-05 article

coefs_nb %>% filter(feature=="article")

##          eco_theses   hec_theses   law_theses   sps_theses feature
## article 2.45817e-05 1.316907e-05 0.0004282097 9.159231e-05 article

This was not part of your task - but one way to get the relation of the coefficients clearer would be to plot them. I do this here for the SVM coefficients since they are more intuitive and useful to understand. However, the NB coefficients also become easier to interpret when you focus on the difference between the departments.

coefs_svm %>%
  filter(feature %in% c("data","regression","qualitative","narrative",
                        "empirical","evidence","model", "measures")) %>%
  ggplot(aes(x=feature))+
  geom_point(aes(y=eco_theses,color="ECO"),alpha=0.7)+
  geom_point(aes(y=sps_theses,color="SPS"),alpha=0.7)+
  geom_point(aes(y=law_theses,color="LAW"),alpha=0.7)+
  geom_point(aes(y=hec_theses,color="HEC"),alpha=0.7)+
  ylab("coefficient")+
  geom_hline(aes(yintercept=0),linetype="dashed")+
  theme_minimal()+scale_color_discrete("Department")

Finally, think about keywords for your PhD: How predictive are they of each department? Do you fit where you are? :)

coefs_nb[coefs_nb$feature=="migration",]

##            eco_theses   hec_theses   law_theses   sps_theses   feature
## migration 0.000245817 0.0002172897 0.0001427366 0.0004338583 migration

coefs_nb[coefs_nb$feature=="democracy",]

##            eco_theses   hec_theses   law_theses  sps_theses   feature
## democracy 4.09695e-05 6.584536e-05 0.0002544435 0.000930385 democracy

Substantive Features - EUI Thesis Abstracts

Session 3

Theresa Gessler

2022-05-09

Preparations

Training the classifier

Feature evaluation