```{r setup}
library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
```
## Preparations
We now have a look at the EUI Thesis abstracts to evaluate what differentiates departments.
Load the data (is it UTF8-encoded, should you have any issues with encoding) and create a corpus.
I recommend that you filter all texts below a certain number of characters by using `str_length()` (you can choose this threshold - e.g. 300 words). The data is a bit messy so I recommend this to get rid of data that was scraped incompletely.
```{r}
```
Now, create a dfm and do whatever pre-processing you think might be useful.
```{r}
```
## Training the classifier
Using the `department` document variable, train a Naive Bayes and a SVM model to predict the department.
If you want, skip the train-test split here because we don't need to see the accuracy of our model.
```{r}
```
## Feature evaluation
You can use `coef()` on the model to get a matrix of coefficients per feature and 'class'.
One thing that is a bit tricky is that both matrixes are different: for the svm coefficients, the departments will be the first dimension, for the naive bayes model, departments are the second dimension.
This is a bit tricky, so I include the code. I recommend
- transposing the matrix of the svm coefficients using `t()`(that is, making the columns into rows and vice-versa)
- storing both matrixes as a data frame
- storing the rownames in a new variable called feature
```{r}
coefs_svm <- coef(model_svm) %>%
t() %>%
data.frame()%>%
mutate(feature=rownames(.))
coefs_nb <- coef(model_nb) %>%
data.frame() %>%
mutate(feature=rownames(.))
```
Now, try to find the most predictive features for two of the departments. You can do this for example by using `arrange()` on the dataframe - but there are other solutions, this is just a data management task.
You should see a difference between the results for both classifiers that is due to the way they reach their predictions building on the feature scores.
```{r}
```
Try a substantive evaluation: Think about words that stand for
- different methods
- different formats (cumulative vs. monograph)
Have a look at the coefficients of these words for each department. Again, there are several ways to do this - you just need to select the correct row of the data frame.
```{r}
```
Finally, think about keywords for your PhD: How predictive are they of each department? Do you fit where you are? :)
```{r}
```