Over the past years, the digitalization of legal, political, journalistic corpora as well as the growth of online sources has allowed researchers to address new questions across political and social science disciplines. This has led to a growth of new methods to analyze text as data.
The aim of this course is to introduce students to the quantitative analysis of textual data. We will cover the theoretical underpinnings of text-as-data approaches, the implementation of these methods through hands-on exercises using the R statistical programming language and discuss examples of recent empirical research that uses these techniques.
The course will cover the collection of text data with webscraping techniques, text preprocessing, dictionaries and descriptive analysis of texts, as well as supervised and unsupervised learning methods to classify the content of text corpora.
Except for the first and last class, all classes are held Tuesday 2pm-4pm.
Accompanying Lab sessions are held Friday 2pm-3pm.
Tasks should be uploaded until the Friday of the Lab Session (12pm) via Dropbox File Request
This is a first introductory session to inform you what the course covers and to collect your previous experiences.
This session covers how to collect data from Webpages. We will talk about basic scraping techniques but also cover which other, more advanced techniques are out there.
This session dives into doing text analysis and teaches you how to describe whole collections of text as well as how to apply dictionaries to learn more about texts.
This session will cover classification of texts into known categories with previously coded training data.
This session will cover classification of texts into unknown categories based on the clustering of text.
After each session, you will receive an exercise sheet with tasks based on the previous session. These will be accompanied by five lab sessions (dates and times to be determined in the first class). For credit, all classes and labs should be attended, and exercise sheets completed.
To follow the course, you will require an R installation (possibly with RStudio).
This is a continuously updated list of packages we will use or cover in the course:
rvest
(potentially: httr
, RSelenium
)quanteda
, caret
, stm
(pontentially: stringr
, readtext
)tidyverse
(especially dplyr
, tidyr
, lubridate
and ggplot2
)rmarkdown
and knitr
You may want to prepare by installing the most important ones:
first_packages <- c("tidyverse","rvest","quanteda","rmarkdown","knitr")
install.packages(first_packages)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA). If not stated otherwise, images are created by the Course Creator.