This is not a list of required readings for the course - instead the resources are meant to give you a starting point for getting into text analysis and scraping and finding interesting applications in your field. If you think something is missing - ping me and I will add. This is by no means complete, updated sporadically and necessarily misses a lot of great work.
Text Books and Cheat Sheets
- Grimmer, Justin, Margaret E. Roberts, und Brandon M. Stewart. Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton, New Jersey Oxford: Princeton University Press, 2022.
- Atteveldt, Wouter van, Damian Trilling, und Carlos Arcíla. Computational analysis of communication: a practical introduction to the analysis of texts, networks, and images with code examples in Python and R. Hoboken, NJ: John Wiley & Sons, 2021.
- James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer, 2013.
- Daniel Jurafky, James Martin: Speech and Language Processing
- Christopher Manning, Prabhakar Raghavan, Hinrich Schütze: An Introduction to Information Retrieval
- Munzert, Simon, Christian Rubba, Peter Meißner, and Dominic Nyhuis. Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining. Chichester, West Sussex, United Kingdom: John Wiley & Sons Inc, 2015.
- Introduction to HTML
- Kohei Watanabe, Stefan Müller: Quanteda Tutorials
- Summer Institute in Computational Social Science Material
- Course materials by Chris Bail
Overviews of the Field
- Atteveldt, Wouter van, and Tai-Quan Peng. “When Communication Meets Computation: Opportunities, Challenges, and Pitfalls in Computational Communication Science.” Communication Methods and Measures 12, no. 2–3 (April 3, 2018): 81–92. https://doi.org/10.1080/19312458.2018.1458084.
- Fréchet, Nadjim, Justin Savoie, and Yannick Dufresne. “Analysis of Text-Analysis Syllabi: Building a Text-Analysis Syllabus Using Scaling.” PS: Political Science & Politics, undefined/ed, 1–6. https://doi.org/10.1017/S1049096519001732.
- Gentzkow, Matthew, Bryan Kelly, and Matt Taddy. “Text as Data.” Journal of Economic Literature 57, no. 3 (September 1, 2019): 535–74. https://doi.org/10.1257/jel.20181020.
- Grimmer, J., and B. M. Stewart. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21, no. 3 (July 1, 2013): 267–97. https://doi.org/10.1093/pan/mps028.
- Schoonvelde, Martijn, Gijs Schumacher, and Bert N. Bakker. “Friends With Text as Data Benefits: Assessing and Extending the Use of Automated Text Analysis in Political Science and Political Psychology.” Journal of Social and Political Psychology 7, no. 1 (February 8, 2019): 124-143–143. https://doi.org/10.5964/jspp.v7i1.964.
General R and Programming Introductions
- Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. “Quanteda: An R Package for the Quantitative Analysis of Textual Data.” Journal of Open Source Software 3, no. 30 (October 6, 2018): 774. https://doi.org/10.21105/joss.00774.
- Roberts, Margaret E., Brandon M. Stewart, and Dustin Tingley. “Stm: R Package for Structural Topic Models.” Journal of Statistical Software, 2013.
- Barberá, Pablo, Amber E. Boydstun, Suzanna Linn, Ryan McMahon, and Jonathan Nagler. “Automated Text Classification of News Articles: A Practical Guide.” Political Analysis, undefined/ed, 1–24. https://doi.org/10.1017/pan.2020.8.
- Cranmer, Skyler J. “Introduction to the Virtual Issue: Machine Learning in Political Science,” n.d., 9.
- Denny, Matthew J., und Arthur Spirling. „Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It“. Political Analysis 26, Nr. 2 (April 2018): 168–89. https://doi.org/10.1017/pan.2017.44.
- Grimmer, Justin, and Gary King. “General Purpose Computer-Assisted Clustering and Conceptualization.” Proceedings of the National Academy of Sciences 108, no. 7 (February 15, 2011): 2643–50. https://doi.org/10.1073/pnas.1018067108.
- Monroe, B. L., and P. A. Schrodt. “Introduction to the Special Issue: The Statistical Analysis of Political Text.” Political Analysis 16, no. 4 (October 4, 2008): 351–55. https://doi.org/10.1093/pan/mpn017.
- Muddiman, Ashley, Shannon C. McGregor, and Natalie Jomini Stroud. “(Re)Claiming Our Expertise: Parsing Large Text Corpora With Manually Validated and Organic Dictionaries.” Political Communication 0, no. 0 (November 7, 2018): 1–13. https://doi.org/10.1080/10584609.2018.1517843.
- Roberts, Margaret E., Brandon M. Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G. Rand. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58, no. 4 (October 1, 2014): 1064–82. https://doi.org/10.1111/ajps.12103.
- Rodman, Emma. “A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors.” Political Analysis, undefined/ed, 1–25. https://doi.org/10.1017/pan.2019.23.
- Slapin, Jonathan B., and Sven-Oliver Proksch. “A Scaling Model for Estimating Time-Series Party Positions from Texts.” American Journal of Political Science 52, no. 3 (July 1, 2008): 705–22. https://doi.org/10.1111/j.1540-5907.2008.00338.x.
Applications mentioned in class and other interesting applications - this list is by no means complete and misses a lot of relevant and great research but I hope it provides a useful starting point!
- Anastasopoulos, L. Jason, and Anthony M. Bertelli. “Understanding Delegation Through Machine Learning: A Method and Application to the European Union.” American Political Science Review, undefined/ed, 1–11. https://doi.org/10.1017/S0003055419000522.
- Bauer, Paul C., Pablo Barberá, Kathrin Ackermann, and Aaron Venetz. “Is the Left-Right Scale a Valid Measure of Ideology?” Political Behavior 39, no. 3 (2017): 553–83.
- Beltran, Javier, Aina Gallego, Alba Huidobro, Enrique Romero, and Lluís Padró. “Male and Female Politicians on Twitter: A Machine Learning Approach.” European Journal of Political Research n/a, no. n/a. Accessed March 24, 2020. https://doi.org/10.1111/1475-6765.12392.
- Benoit, Kenneth, Kevin Munger, and Arthur Spirling. “Measuring and Explaining Political Sophistication through Textual Complexity.” American Journal of Political Science 63, no. 2 (2019): 491–508. https://doi.org/10.1111/ajps.12423.
- Burscher, Bjorn, Rens Vliegenthart, and Claes H. De Vreese. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?” The ANNALS of the American Academy of Political and Social Science 659, no. 1 (May 1, 2015): 122–31. https://doi.org/10.1177/0002716215569441.
- DiMaggio, Paul, Manish Nag, and David Blei. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics, Topic Models and the Cultural Sciences, 41, no. 6 (December 2013): 570–606. https://doi.org/10.1016/j.poetic.2013.08.004.
- Egami, Naoki, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Brandon M Stewart. “How to Make Causal Inferences Using Texts∗,” n.d., 68.
- Gilardi, Fabrizio, Theresa Gessler, Mael Kubli and Stefan Müller. “Social Media and Political Agenda Setting.” Work in Progress, 2020.
- Grimmer, J. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18, no. 1 (January 1, 2010): 1–35. https://doi.org/10.1093/pan/mpp034.
- Hobbs, William R., and Margaret E. Roberts. “How Sudden Censorship Can Increase Access to Information.” American Political Science Review 112, no. 3 (August 2018): 621–36. https://doi.org/10.1017/S0003055418000084.
- King, Gary, Jennifer Pan, and Margaret E. Roberts. “How the Chinese Government Fabricates Social Media Posts for Strategic Distraction, Not Engaged Argument.” American Political Science Review 111, no. 3 (August 2017): 484–501. https://doi.org/10.1017/S0003055417000144.
- Loughran, Tim, and Bill Mcdonald. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66, no. 1 (2011): 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x.
- Peterson, Andrew, and Arthur Spirling. “Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems.” Political Analysis 26, no. 1 (January 2018): 120–28. https://doi.org/10.1017/pan.2017.39.
- Proksch, Sven-Oliver, Will Lowe, Jens Wäckerle, and Stuart Soroka. “Multilingual Sentiment Analysis: A New Approach to Measuring Conflict in Legislative Speeches.” Legislative Studies Quarterly 0, no. 0. Accessed January 20, 2019. https://doi.org/10.1111/lsq.12218.
- Proksch, Sven-Oliver, and Jonathan B. Slapin. “Parliamentary Questions and Oversight in the European Union.” European Journal of Political Research 50, no. 1 (January 1, 2011): 53–79. https://doi.org/10.1111/j.1475-6765.2010.01919.x.
- Rossiter, Erin, Measuring Agenda Setting in Interactive Political Communications, working paper
- Schmidt, Benjamin M. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities, April 5, 2013. http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/.
- Schoonvelde, Martijn, Anna Brosius, Gijs Schumacher, and Bert N. Bakker. “Liberals Lecture, Conservatives Communicate: Analyzing Complexity and Ideology in 381,609 Political Speeches.” PLOS ONE 14, no. 2 (February 6, 2019). https://doi.org/10.1371/journal.pone.0208450.
- Schwemmer, Carsten, and Oliver Wieczorek. “The Methodological Divide of Sociology: Evidence from Two Decades of Journal Publications.” Sociology 54, no. 1 (2020): 3–21.
- Shugars, Sarah. “The Structure of Reasoning: Measuring Justiﬁcation and Preferences in Text”, Working Paper, 26.
- Shugars, Sarah, and Nicholas Beauchamp. “Why Keep Arguing? Predicting Engagement in Political Conversations Online.” SAGE Open 9, no. 1 (January 1, 2019): 2158244019828850. https://doi.org/10.1177/2158244019828850.
- Spirling, Arthur. “Democratization and Linguistic Complexity: The Effect of Franchise Extension on Parliamentary Discourse, 1832–1915.” The Journal of Politics 78, no. 1 (December 17, 2015): 120–36. https://doi.org/10.1086/683612.
- Spirling, Arthur. “U.S. Treaty Making with American Indians: Institutional Change and Relative Power, 1784–1911.” American Journal of Political Science 56, no. 1 (January 1, 2012): 84–97. https://doi.org/10.1111/j.1540-5907.2011.00558.x.
- Terman, Rochelle. “Islamophobia and Media Portrayals of Muslim Women: A Computational Text Analysis of US News Coverage.” International Studies Quarterly 61, no. 3 (September 1, 2017): 489–502. https://doi.org/10.1093/isq/sqx051.
- Watanabe, Kohei, and Yuan Zhou. “Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches.” Social Science Computer Review, February 21, 2020, 0894439320907027. https://doi.org/10.1177/0894439320907027.
- Wiedemann, Gregor. “Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning.” Social Science Computer Review, February 25, 2018, 0894439318758389. https://doi.org/10.1177/0894439318758389.
More advanced text analysis techniques
- Rodman, Emma. „A Timely Intervention: Tracking the Changing Meanings of Political Concepts with Word Vectors“. Political Analysis 28, Nr. 1 (2020). https://doi.org/10.1017/pan.2019.23.
- Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, und James Zou. „Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes“. Proceedings of the National Academy of Sciences 115, Nr. 16 (2018). https://doi.org/10.1073/pnas.1720347115.
- Kozlowski, Austin C., Matt Taddy, und James A. Evans. „The Geometry of Culture: Analyzing the Meanings of Class through Word Embeddings“. American Sociological Review 84, Nr. 5 (1. Oktober 2019): 905–49. https://doi.org/10.1177/0003122419877135.
- Rheault, Ludovic, und Christopher Cochrane. „Word Embeddings for the Analysis of Ideological Placement in Parliamentary Corpora“. Political Analysis 28, Nr. 1 (Januar 2020): 112–33. https://doi.org/10.1017/pan.2019.26.
- Rodríguez, Pedro L., Arthur Spirling, und Brandon M Stewart. „Embedding Regression: Models for Context-Specific Description and Inference in Political Science“, 1. Juli 2021. https://github.com/prodriguezsosa/EmbeddingRegression.
- Spirling, Arthur, und Pedro Rodriguez. „Word embeddings: What works, what doesn’t, and how to tell the difference for applied research“. Journal of Politics forthcoming, 2021.
- Bruns, Axel. “After the ‘APIcalypse’: Social Media Platforms and Their Fight against Critical Scholarly Research.” Information, Communication & Society 0, no. 0 (July 11, 2019): 1–23. https://doi.org/10.1080/1369118X.2019.1637447.
- Freelon, Deen. “Computational Research in the Post-API Age.” Political Communication 35, no. 4 (October 2, 2018): 665–68. https://doi.org/10.1080/10584609.2018.1477506.
- Halavais, Alexander. “Overcoming Terms of Service: A Proposal for Ethical Distributed Research.” Information, Communication & Society 22, no. 11 (September 19, 2019): 1567–81. https://doi.org/10.1080/1369118X.2019.1627386.
- King, Gary, and Nathaniel Persily. “A New Model for Industry–Academic Partnerships.” PS: Political Science & Politics 53, no. 4 (October 2020): 703–9. https://doi.org/10.1017/S1049096519001021.
- Puschmann, Cornelius. “An End to the Wild West of Social Media Research: A Response to Axel Bruns.” Information, Communication & Society 22, no. 11 (September 19, 2019): 1582–89. https://doi.org/10.1080/1369118X.2019.1646300.
- Proksch, Sven-Oliver, Christopher Wratil, and Jens Wäckerle. “Testing the Validity of Automatic Speech Recognition for Political Text Analysis.” Political Analysis 27, no. 3 (July 2019): 339–59. https://doi.org/10.1017/pan.2018.62.
- Webb Williams, Nora, Andreu Casas, and John D. Wilkerson. Images as Data for Social Science Research: An Introduction to Convolutional Neural Nets for Image Classification. 1st ed. Cambridge University Press, 2020. https://doi.org/10.1017/9781108860741.
Books on other Approaches to Text Analysis
Maybe you found that you do like text analysis but R and or quanteda are not for you. Here are some recommendations based on different packages or programming languages: