```{r setup} library(tidyverse) library(quanteda) library(quanteda.textmodels) ``` ## Preparations We now have a look at the EUI Thesis abstracts to evaluate what differentiates departments. Load the data and create a corpus. You can filter all texts below a certain number of characters by using `str_length()` or by checking if the abstract variable is NA - the data is a bit messy so I recommend this to get rid of data that was scraped incompletely or is not available in cadmus. ```{r} ``` Now, create a dfm and do whatever pre-processing you think might be useful. ```{r} ``` ## Training the classifier Using the `department` document variable, train a Naive Bayes and a SVM model to predict the department. If you want, skip the train-test split here because we don't need to see the accuracy of our model. ```{r} ``` ## Feature evaluation You can use `coef()` on the model to get a matrix of coefficients per feature and 'class'. One thing that is a bit tricky is that both matrixes are different: for the svm coefficients, the departments will be the first dimension, for the naive bayes model, departments are the second dimension. This is a bit tricky, so I include the code. I recommend - transposing the matrix of the svm coefficients using `t()`(that is, making the columns into rows and vice-versa) - storing both matrixes as a data frame - storing the rownames in a new variable called feature ```{r} coefs_svm <- coef(model_svm) %>% t() %>% data.frame()%>% mutate(feature=rownames(.)) coefs_nb <- coef(model_nb) %>% data.frame() %>% mutate(feature=rownames(.)) ``` Now, try to find the most predictive features for two of the departments. You can do this for example by using `arrange()` on the dataframe - but there are other solutions, this is just a data management task. You should see a difference between the results for both classifiers that is due to the way they reach their predictions building on the feature scores. ```{r} ``` Try a substantive evaluation: Think about words that stand for - different methods - different formats (cumulative vs. monograph) Have a look at the coefficients of these words for each department. Again, there are several ways to do this - you just need to select the correct row of the data frame. ```{r} ``` Finally, think about keywords for your PhD: How predictive are they of each department? Do you fit where you are? :) ```{r} ```