```{r setup}
library(tidyverse)
library(quanteda)
library(quanteda.textmodels)
```

## Preparations

We now have a look at the EUI Thesis abstracts to evaluate what differentiates departments.

Load the data and create a corpus.

You can filter all texts below a certain number of characters by using `str_length()` or by checking if the abstract variable is NA - the data is a bit messy so I recommend this to get rid of data that was scraped incompletely or is not available in cadmus.

```{r}


```

Now, create a dfm and do whatever pre-processing you think might be useful.

```{r}


```

## Training the classifier


Using the `department` document variable, train a Naive Bayes and a SVM model to predict the department.

If you want, skip the train-test split here because we don't need to see the accuracy of our model.

```{r}


```

## Feature evaluation

You can use `coef()` on the model to get a matrix of coefficients per feature and 'class'.

One thing that is a bit tricky is that both matrixes are different: for the svm coefficients, the departments will be the first dimension, for the naive bayes model, departments are the second dimension.

This is a bit tricky, so I include the code. I recommend

-  transposing the matrix of the svm coefficients using `t()`(that is, making the columns into rows and vice-versa) 
- storing both matrixes as a data frame
- storing the rownames in a new variable called feature

```{r}
coefs_svm <- coef(model_svm) %>%
  t() %>% 
  data.frame()%>%
  mutate(feature=rownames(.))

coefs_nb <- coef(model_nb) %>% 
  data.frame() %>%
  mutate(feature=rownames(.))
```

Now, try to find the most predictive features for two of the departments. You can do this for example by using `arrange()` on the dataframe - but there are other solutions, this is just a data management task.

You should see a difference between the results for both classifiers that is due to the way they reach their predictions building on the feature scores. 

```{r}


```


Try a substantive evaluation: Think about words that stand for

- different methods
- different formats (cumulative vs. monograph)

Have a look at the coefficients of these words for each department. Again, there are several ways to do this - you just need to select the correct row of the data frame.

```{r}


```


Finally, think about keywords for your PhD: How predictive are they of each department? Do you fit where you are? :)


```{r}


```