Course summary

Over the past years, the digitalization of legal, political, journalistic corpora as well as the growth of online sources has allowed researchers to address new questions across political and social science disciplines. This has led to a growth of new methods to analyze text as data.

The aim of this course is to introduce students to the quantitative analysis of textual data. We will cover the theoretical underpinnings of text-as-data approaches, the implementation of these methods through hands-on exercises using the R statistical programming language and discuss examples of recent empirical research that uses these techniques.

The course will cover the collection of text data with webscraping techniques, text preprocessing, dictionaries and descriptive analysis of texts, as well as supervised and unsupervised learning methods to classify the content of text corpora.


  • Dr. Theresa Gessler (Instructor)
  • Tobias Widmann (Teaching Assistant)


Except for the first and last class, all classes are held Tuesday 2pm-4pm.

Accompanying Lab sessions are held Friday 2pm-3pm.

Tasks should be uploaded until the Friday of the Lab Session (12pm) via Dropbox File Request

18/11 Introduction to Text Analysis

This is a first introductory session to inform you what the course covers and to collect your previous experiences.

01/12 Webscraping

This session covers how to collect data from Webpages. We will talk about basic scraping techniques but also cover which other, more advanced techniques are out there.

08/12 Descriptive Analyses, Dictionaries

This session dives into doing text analysis and teaches you how to describe whole collections of text as well as how to apply dictionaries to learn more about texts.

15/12 Supervised Learning Methods

This session will cover classification of texts into known categories with previously coded training data.

11/01 Unsupervised Learning Methods

This session will cover classification of texts into unknown categories based on the clustering of text.


After each session, you will receive an exercise sheet with tasks based on the previous session. These will be accompanied by five lab sessions (dates and times to be determined in the first class). For credit, all classes and labs should be attended, and exercise sheets completed.

To follow the course, you will require an R installation (possibly with RStudio).

This is a continuously updated list of packages we will use or cover in the course:

  • scraping: rvest (potentially: httr, RSelenium)
  • text analysis: quanteda, caret, stm (pontentially: stringr, readtext)
  • data wrangling and visualization: tidyverse (especially dplyr, tidyr, lubridate and ggplot2)
  • creating documents and reports: rmarkdown and knitr

You may want to prepare by installing the most important ones:

first_packages <- c("tidyverse","rvest","quanteda","rmarkdown","knitr")

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA). If not stated otherwise, images are created by the Course Creator.