This is not a list of required readings for the course - instead the resources are meant to give you a starting point for getting into text analysis and scraping and finding interesting applications in your field. If you think something is missing - ping me and I will add. This is by no means complete, updated sporadically and necessarily misses a lot of great work.

Basics

Text Books and Cheat Sheets

Grimmer, Justin, Margaret E. Roberts, und Brandon M. Stewart. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, New Jersey Oxford: Princeton University Press, 2022.
Atteveldt, Wouter van, Damian Trilling, und Carlos Arcíla. Computational analysis of communication: a practical introduction to the analysis of texts, networks, and images with code examples in Python and R. Hoboken, NJ: John Wiley & Sons, 2021.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer, 2013.
Daniel Jurafky, James Martin: Speech and Language Processing
Christopher Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval
Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc, 2015.
Introduction to HTML
Kohei Watanabe, Stefan Müller: Quanteda Tutorials
Summer Institute in Computational Social Science Material
Course materials by Chris Bail

Overviews of the Field

Atteveldt, Wouter van, and Tai-Quan Peng. “When Communication Meets Computation: Opportunities, Challenges, and Pitfalls in Computational Communication Science.” Communication Methods and Measures 12, no. 2–3 (April 3, 2018): 81–92. https://doi.org/10.1080/19312458.2018.1458084.
Fréchet, Nadjim, Justin Savoie, and Yannick Dufresne. “Analysis of Text-Analysis Syllabi: Building a Text-Analysis Syllabus Using Scaling.” PS: Political Science & Politics, undefined/ed, 1–6. https://doi.org/10.1017/S1049096519001732.
Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. “Text as Data.” Journal of Economic Literature 57, no. 3 (September 1, 2019): 535–74. https://doi.org/10.1257/jel.20181020.
Grimmer, J., and B. M. Stewart. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21, no. 3 (July 1, 2013): 267–97. https://doi.org/10.1093/pan/mps028.
Schoonvelde, Martijn, Gijs Schumacher, and Bert N. Bakker. “Friends With Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology.” Journal of Social and Political Psychology 7, no. 1 (February 8, 2019): 124-143–143. https://doi.org/10.5964/jspp.v7i1.964.

General R and Programming Introductions

RStudio Cheat Sheets
SoftwareCarpentry
R for Reproducible Research
tidyverse style guide
Code and Data for the Social Sciences: A Practitioner’s Guide
Garrett Grolemund: Hands-On Programming with R
Hadley Wickham, Garrett Grolemund: R for Data Science
Hadley Wickham: Advanced R
Computational Tools for Social Science Course Materials by Rochelle Terman
Computing for the Social Sciences Course Materials by Benjamin Soltoff

Packages

Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3, no. 30 (October 6, 2018): 774. https://doi.org/10.21105/joss.00774.
Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. “Stm: R Package for Structural Topic Models.” Journal of Statistical Software, 2013.

Key Methods

Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis, undefined/ed, 1–24. https://doi.org/10.1017/pan.2020.8.
Cranmer, Skyler J. “Introduction to the Virtual Issue: Machine Learning in Political Science,” n.d., 9.
Denny, Matthew J., und Arthur Spirling. „Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It“. Political Analysis 26, Nr. 2 (April 2018): 168–89. https://doi.org/10.1017/pan.2017.44.
Grimmer, Justin, and Gary King. “General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences 108, no. 7 (February 15, 2011): 2643–50. https://doi.org/10.1073/pnas.1018067108.
Monroe, B. L., and P. A. Schrodt. “Introduction to the Special Issue: The Statistical Analysis of Political Text.” Political Analysis 16, no. 4 (October 4, 2008): 351–55. https://doi.org/10.1093/pan/mpn017.
Muddiman, Ashley, Shannon C. McGregor, and Natalie Jomini Stroud. “(Re)Claiming Our Expertise: Parsing Large Text Corpora With Manually Validated and Organic Dictionaries.” Political Communication 0, no. 0 (November 7, 2018): 1–13. https://doi.org/10.1080/10584609.2018.1517843.
Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58, no. 4 (October 1, 2014): 1064–82. https://doi.org/10.1111/ajps.12103.
Rodman, Emma. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis, undefined/ed, 1–25. https://doi.org/10.1017/pan.2019.23.
Slapin, Jonathan B., and Sven-Oliver Proksch. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52, no. 3 (July 1, 2008): 705–22. https://doi.org/10.1111/j.1540-5907.2008.00338.x.

Interesting Applications

Applications mentioned in class and other interesting applications - this list is by no means complete and misses a lot of relevant and great research but I hope it provides a useful starting point!

Anastasopoulos, L. Jason, and Anthony M. Bertelli. “Understanding Delegation Through Machine Learning: A Method and Application to the European Union.” American Political Science Review, undefined/ed, 1–11. https://doi.org/10.1017/S0003055419000522.
Bauer, Paul C., Pablo Barberá, Kathrin Ackermann, and Aaron Venetz. “Is the Left-Right Scale a Valid Measure of Ideology?” Political Behavior 39, no. 3 (2017): 553–83.
Beltran, Javier, Aina Gallego, Alba Huidobro, Enrique Romero, and Lluís Padró. “Male and Female Politicians on Twitter: A Machine Learning Approach.” European Journal of Political Research n/a, no. n/a. Accessed March 24, 2020. https://doi.org/10.1111/1475-6765.12392.
Benoit, Kenneth, Kevin Munger, and Arthur Spirling. “Measuring and Explaining Political Sophistication through Textual Complexity.” American Journal of Political Science 63, no. 2 (2019): 491–508. https://doi.org/10.1111/ajps.12423.
Burscher, Bjorn, Rens Vliegenthart, and Claes H. De Vreese. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?” The ANNALS of the American Academy of Political and Social Science 659, no. 1 (May 1, 2015): 122–31. https://doi.org/10.1177/0002716215569441.
DiMaggio, Paul, Manish Nag, and David Blei. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics, Topic Models and the Cultural Sciences, 41, no. 6 (December 2013): 570–606. https://doi.org/10.1016/j.poetic.2013.08.004.
Egami, Naoki, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Brandon M Stewart. “How to Make Causal Inferences Using Texts∗,” n.d., 68.
Gilardi, Fabrizio, Theresa Gessler, Mael Kubli and Stefan Müller. “Social Media and Political Agenda Setting.” Work in Progress, 2020.
Grimmer, J. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18, no. 1 (January 1, 2010): 1–35. https://doi.org/10.1093/pan/mpp034.
Hobbs, William R., and Margaret E. Roberts. “How Sudden Censorship Can Increase Access to Information.” American Political Science Review 112, no. 3 (August 2018): 621–36. https://doi.org/10.1017/S0003055418000084.
King, Gary, Jennifer Pan, and Margaret E. Roberts. “How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument.” American Political Science Review 111, no. 3 (August 2017): 484–501. https://doi.org/10.1017/S0003055417000144.
Loughran, Tim, and Bill Mcdonald. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66, no. 1 (2011): 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x.
Peterson, Andrew, and Arthur Spirling. “Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems.” Political Analysis 26, no. 1 (January 2018): 120–28. https://doi.org/10.1017/pan.2017.39.
Proksch, Sven-Oliver, Will Lowe, Jens Wäckerle, and Stuart Soroka. “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches.” Legislative Studies Quarterly 0, no. 0. Accessed January 20, 2019. https://doi.org/10.1111/lsq.12218.
Proksch, Sven-Oliver, and Jonathan B. Slapin. “Parliamentary Questions and Oversight in the European Union.” European Journal of Political Research 50, no. 1 (January 1, 2011): 53–79. https://doi.org/10.1111/j.1475-6765.2010.01919.x.
Rossiter, Erin, Measuring Agenda Setting in Interactive Political Communications, working paper
Schmidt, Benjamin M. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities, April 5, 2013. http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/.
Schoonvelde, Martijn, Anna Brosius, Gijs Schumacher, and Bert N. Bakker. “Liberals Lecture, Conservatives Communicate: Analyzing Complexity and Ideology in 381,609 Political Speeches.” PLOS ONE 14, no. 2 (February 6, 2019). https://doi.org/10.1371/journal.pone.0208450.
Schwemmer, Carsten, and Oliver Wieczorek. “The Methodological Divide of Sociology: Evidence from Two Decades of Journal Publications.” Sociology 54, no. 1 (2020): 3–21.
Shugars, Sarah. “The Structure of Reasoning: Measuring Justiﬁcation and Preferences in Text”, Working Paper, 26.
Shugars, Sarah, and Nicholas Beauchamp. “Why Keep Arguing? Predicting Engagement in Political Conversations Online.” SAGE Open 9, no. 1 (January 1, 2019): 2158244019828850. https://doi.org/10.1177/2158244019828850.
Spirling, Arthur. “Democratization and Linguistic Complexity: The Effect of Franchise Extension on Parliamentary Discourse, 1832–1915.” The Journal of Politics 78, no. 1 (December 17, 2015): 120–36. https://doi.org/10.1086/683612.
Spirling, Arthur. “U.S. Treaty Making with American Indians: Institutional Change and Relative Power, 1784–1911.” American Journal of Political Science 56, no. 1 (January 1, 2012): 84–97. https://doi.org/10.1111/j.1540-5907.2011.00558.x.
Terman, Rochelle. “Islamophobia and Media Portrayals of Muslim Women: A Computational Text Analysis of US News Coverage.” International Studies Quarterly 61, no. 3 (September 1, 2017): 489–502. https://doi.org/10.1093/isq/sqx051.
Watanabe, Kohei, and Yuan Zhou. “Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches.” Social Science Computer Review, February 21, 2020, 0894439320907027. https://doi.org/10.1177/0894439320907027.
Wiedemann, Gregor. “Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning.” Social Science Computer Review, February 25, 2018, 0894439318758389. https://doi.org/10.1177/0894439318758389.

More advanced text analysis techniques

Word embeddings

Rodman, Emma. „A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors“. Political Analysis 28, Nr. 1 (2020). https://doi.org/10.1017/pan.2019.23.
Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, und James Zou. „Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes“. Proceedings of the National Academy of Sciences 115, Nr. 16 (2018). https://doi.org/10.1073/pnas.1720347115.
Kozlowski, Austin C., Matt Taddy, und James A. Evans. „The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings“. American Sociological Review 84, Nr. 5 (1. Oktober 2019): 905–49. https://doi.org/10.1177/0003122419877135.
Rheault, Ludovic, und Christopher Cochrane. „Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora“. Political Analysis 28, Nr. 1 (Januar 2020): 112–33. https://doi.org/10.1017/pan.2019.26.
Rodríguez, Pedro L., Arthur Spirling, und Brandon M Stewart. „Embedding Regression: Models for Context-Specific Description and Inference in Political Science“, 1. Juli 2021. https://github.com/prodriguezsosa/EmbeddingRegression.
Spirling, Arthur, und Pedro Rodriguez. „Word embeddings: What works, what doesn’t, and how to tell the difference for applied research“. Journal of Politics forthcoming, 2021.

Scraping Ethics

Bruns, Axel. “After the ‘APIcalypse’: Social Media Platforms and Their Fight against Critical Scholarly Research.” Information, Communication & Society 0, no. 0 (July 11, 2019): 1–23. https://doi.org/10.1080/1369118X.2019.1637447.
Freelon, Deen. “Computational Research in the Post-API Age.” Political Communication 35, no. 4 (October 2, 2018): 665–68. https://doi.org/10.1080/10584609.2018.1477506.
Halavais, Alexander. “Overcoming Terms of Service: A Proposal for Ethical Distributed Research.” Information, Communication & Society 22, no. 11 (September 19, 2019): 1567–81. https://doi.org/10.1080/1369118X.2019.1627386.
King, Gary, and Nathaniel Persily. “A New Model for Industry–Academic Partnerships.” PS: Political Science & Politics 53, no. 4 (October 2020): 703–9. https://doi.org/10.1017/S1049096519001021.
Puschmann, Cornelius. “An End to the Wild West of Social Media Research: A Response to Axel Bruns.” Information, Communication & Society 22, no. 11 (September 19, 2019): 1582–89. https://doi.org/10.1080/1369118X.2019.1646300.

Beyond Text

Proksch, Sven-Oliver, Christopher Wratil, and Jens Wäckerle. “Testing the Validity of Automatic Speech Recognition for Political Text Analysis.” Political Analysis 27, no. 3 (July 2019): 339–59. https://doi.org/10.1017/pan.2018.62.
Webb Williams, Nora, Andreu Casas, and John D. Wilkerson. Images as Data for Social Science Research: An Introduction to Convolutional Neural Nets for Image Classification. 1st ed. Cambridge University Press, 2020. https://doi.org/10.1017/9781108860741.

Books on other Approaches to Text Analysis

Maybe you found that you do like text analysis but R and or quanteda are not for you. Here are some recommendations based on different packages or programming languages:

Python: Dirk Hovy: Text Analysis in Python for Social Scientists. Discovery and Exploration
R with tidytext: Julia Silge, David Robinson: Text Mining with R

Readings

last updated: 2022-05-08