This is not a list of required readings for the course - instead the resources are meant to give you a starting point for getting into text analysis and scraping and finding interesting applications in your field. If you think something is missing - ping me and I will add. This is by no means complete, updated sporadically and necessarily misses a lot of great work.

Basics

Text Books and Cheat Sheets

  • Grimmer, Justin, Margaret E. Roberts, und Brandon M. Stewart. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, New Jersey Oxford: Princeton University Press, 2022.
  • Atteveldt, Wouter van, Damian Trilling, und Carlos Arcíla. Computational analysis of communication: a practical introduction to the analysis of texts, networks, and images with code examples in Python and R. Hoboken, NJ: John Wiley & Sons, 2021.
  • James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer, 2013.
  • Daniel Jurafky, James Martin: Speech and Language Processing
  • Christopher Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval
  • Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc, 2015.
  • Introduction to HTML
  • Kohei Watanabe, Stefan Müller: Quanteda Tutorials
  • Summer Institute in Computational Social Science Material
  • Course materials by Chris Bail

Overviews of the Field

  • Atteveldt, Wouter van, and Tai-Quan Peng. “When Communication Meets Computation: Opportunities, Challenges, and Pitfalls in Computational Communication Science.” Communication Methods and Measures 12, no. 2–3 (April 3, 2018): 81–92. https://doi.org/10.1080/19312458.2018.1458084.
  • Fréchet, Nadjim, Justin Savoie, and Yannick Dufresne. “Analysis of Text-Analysis Syllabi: Building a Text-Analysis Syllabus Using Scaling.” PS: Political Science & Politics, undefined/ed, 1–6. https://doi.org/10.1017/S1049096519001732.
  • Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. “Text as Data.” Journal of Economic Literature 57, no. 3 (September 1, 2019): 535–74. https://doi.org/10.1257/jel.20181020.
  • Grimmer, J., and B. M. Stewart. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21, no. 3 (July 1, 2013): 267–97. https://doi.org/10.1093/pan/mps028.
  • Schoonvelde, Martijn, Gijs Schumacher, and Bert N. Bakker. “Friends With Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology.” Journal of Social and Political Psychology 7, no. 1 (February 8, 2019): 124-143–143. https://doi.org/10.5964/jspp.v7i1.964.

Packages

  • Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3, no. 30 (October 6, 2018): 774. https://doi.org/10.21105/joss.00774.
  • Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. “Stm: R Package for Structural Topic Models.” Journal of Statistical Software, 2013.

Key Methods

  • Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis, undefined/ed, 1–24. https://doi.org/10.1017/pan.2020.8.
  • Cranmer, Skyler J. “Introduction to the Virtual Issue: Machine Learning in Political Science,” n.d., 9.
  • Denny, Matthew J., und Arthur Spirling. „Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It“. Political Analysis 26, Nr. 2 (April 2018): 168–89. https://doi.org/10.1017/pan.2017.44.
  • Grimmer, Justin, and Gary King. “General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences 108, no. 7 (February 15, 2011): 2643–50. https://doi.org/10.1073/pnas.1018067108.
  • Monroe, B. L., and P. A. Schrodt. “Introduction to the Special Issue: The Statistical Analysis of Political Text.” Political Analysis 16, no. 4 (October 4, 2008): 351–55. https://doi.org/10.1093/pan/mpn017.
  • Muddiman, Ashley, Shannon C. McGregor, and Natalie Jomini Stroud. “(Re)Claiming Our Expertise: Parsing Large Text Corpora With Manually Validated and Organic Dictionaries.” Political Communication 0, no. 0 (November 7, 2018): 1–13. https://doi.org/10.1080/10584609.2018.1517843.
  • Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58, no. 4 (October 1, 2014): 1064–82. https://doi.org/10.1111/ajps.12103.
  • Rodman, Emma. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis, undefined/ed, 1–25. https://doi.org/10.1017/pan.2019.23.
  • Slapin, Jonathan B., and Sven-Oliver Proksch. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52, no. 3 (July 1, 2008): 705–22. https://doi.org/10.1111/j.1540-5907.2008.00338.x.

Interesting Applications

Applications mentioned in class and other interesting applications - this list is by no means complete and misses a lot of relevant and great research but I hope it provides a useful starting point!

  • Anastasopoulos, L. Jason, and Anthony M. Bertelli. “Understanding Delegation Through Machine Learning: A Method and Application to the European Union.” American Political Science Review, undefined/ed, 1–11. https://doi.org/10.1017/S0003055419000522.
  • Bauer, Paul C., Pablo Barberá, Kathrin Ackermann, and Aaron Venetz. “Is the Left-Right Scale a Valid Measure of Ideology?” Political Behavior 39, no. 3 (2017): 553–83.
  • Beltran, Javier, Aina Gallego, Alba Huidobro, Enrique Romero, and Lluís Padró. “Male and Female Politicians on Twitter: A Machine Learning Approach.” European Journal of Political Research n/a, no. n/a. Accessed March 24, 2020. https://doi.org/10.1111/1475-6765.12392.
  • Benoit, Kenneth, Kevin Munger, and Arthur Spirling. “Measuring and Explaining Political Sophistication through Textual Complexity.” American Journal of Political Science 63, no. 2 (2019): 491–508. https://doi.org/10.1111/ajps.12423.
  • Burscher, Bjorn, Rens Vliegenthart, and Claes H. De Vreese. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?” The ANNALS of the American Academy of Political and Social Science 659, no. 1 (May 1, 2015): 122–31. https://doi.org/10.1177/0002716215569441.
  • DiMaggio, Paul, Manish Nag, and David Blei. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics, Topic Models and the Cultural Sciences, 41, no. 6 (December 2013): 570–606. https://doi.org/10.1016/j.poetic.2013.08.004.
  • Egami, Naoki, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Brandon M Stewart. “How to Make Causal Inferences Using Texts∗,” n.d., 68.
  • Gilardi, Fabrizio, Theresa Gessler, Mael Kubli and Stefan Müller. “Social Media and Political Agenda Setting.” Work in Progress, 2020.
  • Grimmer, J. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18, no. 1 (January 1, 2010): 1–35. https://doi.org/10.1093/pan/mpp034.
  • Hobbs, William R., and Margaret E. Roberts. “How Sudden Censorship Can Increase Access to Information.” American Political Science Review 112, no. 3 (August 2018): 621–36. https://doi.org/10.1017/S0003055418000084.
  • King, Gary, Jennifer Pan, and Margaret E. Roberts. “How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument.” American Political Science Review 111, no. 3 (August 2017): 484–501. https://doi.org/10.1017/S0003055417000144.
  • Loughran, Tim, and Bill Mcdonald. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66, no. 1 (2011): 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x.
  • Peterson, Andrew, and Arthur Spirling. “Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems.” Political Analysis 26, no. 1 (January 2018): 120–28. https://doi.org/10.1017/pan.2017.39.
  • Proksch, Sven-Oliver, Will Lowe, Jens Wäckerle, and Stuart Soroka. “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches.” Legislative Studies Quarterly 0, no. 0. Accessed January 20, 2019. https://doi.org/10.1111/lsq.12218.
  • Proksch, Sven-Oliver, and Jonathan B. Slapin. “Parliamentary Questions and Oversight in the European Union.” European Journal of Political Research 50, no. 1 (January 1, 2011): 53–79. https://doi.org/10.1111/j.1475-6765.2010.01919.x.
  • Rossiter, Erin, Measuring Agenda Setting in Interactive Political Communications, working paper
  • Schmidt, Benjamin M. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities, April 5, 2013. http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/.
  • Schoonvelde, Martijn, Anna Brosius, Gijs Schumacher, and Bert N. Bakker. “Liberals Lecture, Conservatives Communicate: Analyzing Complexity and Ideology in 381,609 Political Speeches.” PLOS ONE 14, no. 2 (February 6, 2019). https://doi.org/10.1371/journal.pone.0208450.
  • Schwemmer, Carsten, and Oliver Wieczorek. “The Methodological Divide of Sociology: Evidence from Two Decades of Journal Publications.” Sociology 54, no. 1 (2020): 3–21.
  • Shugars, Sarah. “The Structure of Reasoning: Measuring Justification and Preferences in Text”, Working Paper, 26.
  • Shugars, Sarah, and Nicholas Beauchamp. “Why Keep Arguing? Predicting Engagement in Political Conversations Online.” SAGE Open 9, no. 1 (January 1, 2019): 2158244019828850. https://doi.org/10.1177/2158244019828850.
  • Spirling, Arthur. “Democratization and Linguistic Complexity: The Effect of Franchise Extension on Parliamentary Discourse, 1832–1915.” The Journal of Politics 78, no. 1 (December 17, 2015): 120–36. https://doi.org/10.1086/683612.
  • Spirling, Arthur. “U.S. Treaty Making with American Indians: Institutional Change and Relative Power, 1784–1911.” American Journal of Political Science 56, no. 1 (January 1, 2012): 84–97. https://doi.org/10.1111/j.1540-5907.2011.00558.x.
  • Terman, Rochelle. “Islamophobia and Media Portrayals of Muslim Women: A Computational Text Analysis of US News Coverage.” International Studies Quarterly 61, no. 3 (September 1, 2017): 489–502. https://doi.org/10.1093/isq/sqx051.
  • Watanabe, Kohei, and Yuan Zhou. “Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches.” Social Science Computer Review, February 21, 2020, 0894439320907027. https://doi.org/10.1177/0894439320907027.
  • Wiedemann, Gregor. “Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning.” Social Science Computer Review, February 25, 2018, 0894439318758389. https://doi.org/10.1177/0894439318758389.

More advanced text analysis techniques

Word embeddings

  • Rodman, Emma. „A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors“. Political Analysis 28, Nr. 1 (2020). https://doi.org/10.1017/pan.2019.23.
  • Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, und James Zou. „Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes“. Proceedings of the National Academy of Sciences 115, Nr. 16 (2018). https://doi.org/10.1073/pnas.1720347115.
  • Kozlowski, Austin C., Matt Taddy, und James A. Evans. „The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings“. American Sociological Review 84, Nr. 5 (1. Oktober 2019): 905–49. https://doi.org/10.1177/0003122419877135.
  • Rheault, Ludovic, und Christopher Cochrane. „Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora“. Political Analysis 28, Nr. 1 (Januar 2020): 112–33. https://doi.org/10.1017/pan.2019.26.
  • Rodríguez, Pedro L., Arthur Spirling, und Brandon M Stewart. „Embedding Regression: Models for Context-Specific Description and Inference in Political Science“, 1. Juli 2021. https://github.com/prodriguezsosa/EmbeddingRegression.
  • Spirling, Arthur, und Pedro Rodriguez. „Word embeddings: What works, what doesn’t, and how to tell the difference for applied research“. Journal of Politics forthcoming, 2021.

Scraping Ethics

Beyond Text

  • Proksch, Sven-Oliver, Christopher Wratil, and Jens Wäckerle. “Testing the Validity of Automatic Speech Recognition for Political Text Analysis.” Political Analysis 27, no. 3 (July 2019): 339–59. https://doi.org/10.1017/pan.2018.62.
  • Webb Williams, Nora, Andreu Casas, and John D. Wilkerson. Images as Data for Social Science Research: An Introduction to Convolutional Neural Nets for Image Classification. 1st ed. Cambridge University Press, 2020. https://doi.org/10.1017/9781108860741.

Books on other Approaches to Text Analysis

Maybe you found that you do like text analysis but R and or quanteda are not for you. Here are some recommendations based on different packages or programming languages: