Course summary

Over the past years, the digitalization of legal, political, journalistic corpora as well as the growth of online sources has allowed researchers to address new questions across political and social science disciplines. This has led to a growth of new methods to analyze text as data.

The aim of this course is to introduce students to the quantitative analysis of textual data. We will cover the theoretical underpinnings of text-as-data approaches, the implementation of these methods through hands-on exercises using the R statistical programming language and discuss examples of recent empirical research that uses these techniques.

The course will cover the collection of text data with webscraping techniques, text preprocessing, dictionaries and descriptive analysis of texts, as well as supervised and unsupervised learning methods to classify the content of text corpora.


  • Dr. Theresa Gessler (Instructor)
  • Mirko Wegemann (Lab instructor / Teaching Assistant 2022)
  • Tobias Widmann (Former Lab instructor / Teaching Assistant 2021)


  • 10/05 9-13h, lab 14-17h
  • 11/05 9-13h, lab 14-17h
  • 12/05 9-13h, lab 14-17h

Tuesday 10/05 Introduction to Text Analysis, Descriptive Analyses & Dictionaries

This is a first introductory session to inform students what the course covers and discuss the fundamentals of using text as data. The second part dives into doing text analysis and teaches students how to describe whole collections of text as well as how to apply dictionaries to learn more about texts.

Wednesday 11/05 Supervised and Unsupervised Learning Methods

The first session will cover classification of texts into known categories with previously coded training data. The second session classification of texts into unknown categories based on the clustering of text.

Thursday 12/05 Webscraping & the Text Analysis Pipeline

This session covers how to collect data from Webpages. We will talk about basic scraping techniques but also cover which other, more advanced techniques are out there. Finally, we will collect data to put some of the methods discussed in the previous days to work.


After each session, you will receive an exercise sheet with tasks based on the previous session. These will be accompanied by five lab sessions (dates and times to be determined in the first class). For credit, all classes and labs should be attended, and exercise sheets completed.

To follow the course, you will require an R installation (possibly with RStudio).

This is a continuously updated list of packages we will use or cover in the course:

  • scraping: rvest (potentially: httr, RSelenium)
  • text analysis: quanteda, quanteda.textstats, quanteda.textmodels, quanteda.textplots, stm (potentially: stringr, readtext, caret)
  • data wrangling and visualization: tidyverse (especially dplyr, tidyr, lubridate and ggplot2)
  • creating documents and reports: rmarkdown and knitr

You may want to prepare by installing the most important ones:

first_packages <- c("tidyverse","rvest","quanteda","quanteda.textstats",


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA). If not stated otherwise, images are created by the Course Creator.