library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
We now have a look at the EUI Thesis abstracts to evaluate what differentiates departments.
Load the data and create a corpus.
You can filter all texts below a certain number of characters by using str_length()
or by checking if the abstract variable is NA - the data is a bit messy so I recommend this to get rid of data that was scraped incompletely or is not available in cadmus.
load("../data/theses.RData")
eui_corp <- theses %>%
dplyr::filter(str_length(abstract)>300) %>%
corpus(text_field="abstract")
Now, create a dfm and do whatever pre-processing you think might be useful.
eui_dfm <- tokens(eui_corp,
remove_punct=T,
remove_symbols=T,
split_hyphens=T) %>%
dfm()
# Note that you could of course do more extensive preprocessing
# For example, as you may see later on, there are non-english theses in the sample
# hence, you might try to find and delete these documents using frequent french, german and italian words
Using the department
document variable, train a Naive Bayes and a SVM model to predict the department.
If you want, skip the train-test split here because we don’t need to see the accuracy of our model.
model_nb <- textmodel_nb(eui_dfm,docvars(eui_dfm,"department"))
model_svm <- textmodel_svm(eui_dfm,docvars(eui_dfm,"department"))
You can use coef()
on the model to get a matrix of coefficients per feature and ‘class’.
One thing that is a bit tricky is that both matrixes are different: for the svm coefficients, the departments will be the first dimension, for the naive bayes model, departments are the second dimension.
This is a bit tricky, so I include the code. I recommend
t()
(that is, making the columns into rows and vice-versa)coefs_svm <- coef(model_svm) %>%
t() %>%
data.frame()%>%
mutate(feature=rownames(.))
coefs_nb <- coef(model_nb) %>%
data.frame() %>%
mutate(feature=rownames(.))
Now, try to find the most predictive features for two of the departments. You can do this for example by using arrange()
on the dataframe - but there are other solutions, this is just a data management task.
You should see a difference between the results for both classifiers that is due to the way they reach their predictions building on the feature scores.
coefs_nb %>%
arrange(desc(eco_theses)) %>%
head() %>%
# no need for this line - it just makes the results nicer to read in rmarkdown
knitr::kable(caption="NB Economy")
eco_theses | hec_theses | law_theses | sps_theses | feature | |
---|---|---|---|---|---|
the | 0.0563003 | 0.0645680 | 0.0695189 | 0.0651366 | the |
of | 0.0318497 | 0.0393360 | 0.0441801 | 0.0412069 | of |
in | 0.0227299 | 0.0209059 | 0.0187357 | 0.0234380 | in |
and | 0.0208371 | 0.0291695 | 0.0262201 | 0.0313294 | and |
to | 0.0183789 | 0.0161914 | 0.0197473 | 0.0207963 | to |
a | 0.0168630 | 0.0141238 | 0.0146522 | 0.0141004 | a |
coefs_svm %>%
arrange(desc(eco_theses)) %>%
head() %>%
knitr::kable(caption="SVM Economy")
sps_theses | hec_theses | law_theses | eco_theses | feature | |
---|---|---|---|---|---|
chapter | -0.1041301 | -0.0413539 | -0.0014945 | 0.1187724 | chapter |
models | -0.1054706 | -0.0716182 | -0.0463843 | 0.1074100 | models |
economy | -0.0514300 | -0.0407055 | -0.0385985 | 0.1069813 | economy |
chapters | -0.1034605 | 0.0013019 | -0.0401247 | 0.1027875 | chapters |
network | -0.0007803 | -0.0305924 | -0.0491171 | 0.1002902 | network |
labor | -0.0497055 | -0.0406644 | -0.0608924 | 0.0912014 | labor |
coefs_nb %>%
arrange(desc(sps_theses)) %>%
head() %>%
knitr::kable(caption="NB SPS")
eco_theses | hec_theses | law_theses | sps_theses | feature | |
---|---|---|---|---|---|
the | 0.0563003 | 0.0645680 | 0.0695189 | 0.0651366 | the |
of | 0.0318497 | 0.0393360 | 0.0441801 | 0.0412069 | of |
and | 0.0208371 | 0.0291695 | 0.0262201 | 0.0313294 | and |
in | 0.0227299 | 0.0209059 | 0.0187357 | 0.0234380 | in |
to | 0.0183789 | 0.0161914 | 0.0197473 | 0.0207963 | to |
a | 0.0168630 | 0.0141238 | 0.0146522 | 0.0141004 | a |
coefs_svm %>%
arrange(desc(sps_theses)) %>%
head() %>%
knitr::kable(caption="SVM SPS")
sps_theses | hec_theses | law_theses | eco_theses | feature | |
---|---|---|---|---|---|
why | 0.1362991 | -0.0494612 | -0.0176162 | -0.0599172 | why |
contemporary | 0.1197083 | -0.0704051 | -0.0379703 | -0.0408072 | contemporary |
problem | 0.1158665 | -0.0701402 | -0.0541148 | -0.0212415 | problem |
sociology | 0.1093154 | -0.0818114 | 0.0019105 | -0.0072811 | sociology |
argues | 0.1086896 | -0.0463778 | -0.0159006 | -0.0353079 | argues |
research | 0.1083545 | -0.0116346 | 0.0115430 | -0.0823018 | research |
Try a substantive evaluation: Think about words that stand for
Have a look at the coefficients of these words for each department. Again, there are several ways to do this - you just need to select the correct row of the data frame.
# methods
coefs_nb[coefs_nb$feature=="data",]
## eco_theses hec_theses law_theses sps_theses feature
## data 0.001868209 8.559896e-05 0.0006516235 0.0009834122 data
coefs_nb[coefs_nb$feature=="regression",]
## eco_theses hec_theses law_theses sps_theses feature
## regression 0.0001392963 6.584536e-06 6.205938e-06 6.748907e-05 regression
coefs_nb[coefs_nb$feature=="qualitative",]
## eco_theses hec_theses law_theses sps_theses feature
## qualitative 3.27756e-05 3.950721e-05 5.585344e-05 0.0003760105 qualitative
coefs_nb[coefs_nb$feature=="narrative",]
## eco_theses hec_theses law_theses sps_theses feature
## narrative 4.91634e-05 0.0002370433 0.0001055009 8.195101e-05 narrative
# formats
coefs_nb[coefs_nb$feature=="chapter",]
## eco_theses hec_theses law_theses sps_theses feature
## chapter 0.005883221 0.000454333 0.0008750372 0.0008098688 chapter
coefs_nb[coefs_nb$feature=="article",]
## eco_theses hec_theses law_theses sps_theses feature
## article 2.45817e-05 1.316907e-05 0.0004282097 9.159231e-05 article
coefs_nb %>% filter(feature=="article")
## eco_theses hec_theses law_theses sps_theses feature
## article 2.45817e-05 1.316907e-05 0.0004282097 9.159231e-05 article
This was not part of your task - but one way to get the relation of the coefficients clearer would be to plot them. I do this here for the SVM coefficients since they are more intuitive and useful to understand. However, the NB coefficients also become easier to interpret when you focus on the difference between the departments.
coefs_svm %>%
filter(feature %in% c("data","regression","qualitative","narrative",
"empirical","evidence","model", "measures")) %>%
ggplot(aes(x=feature))+
geom_point(aes(y=eco_theses,color="ECO"),alpha=0.7)+
geom_point(aes(y=sps_theses,color="SPS"),alpha=0.7)+
geom_point(aes(y=law_theses,color="LAW"),alpha=0.7)+
geom_point(aes(y=hec_theses,color="HEC"),alpha=0.7)+
ylab("coefficient")+
geom_hline(aes(yintercept=0),linetype="dashed")+
theme_minimal()+scale_color_discrete("Department")
Finally, think about keywords for your PhD: How predictive are they of each department? Do you fit where you are? :)
coefs_nb[coefs_nb$feature=="migration",]
## eco_theses hec_theses law_theses sps_theses feature
## migration 0.000245817 0.0002172897 0.0001427366 0.0004338583 migration
coefs_nb[coefs_nb$feature=="democracy",]
## eco_theses hec_theses law_theses sps_theses feature
## democracy 4.09695e-05 6.584536e-05 0.0002544435 0.000930385 democracy