class: center, middle, inverse, title-slide # Webscraping and the Text Analysis Pipeline ## Computational Text Analysis ### Theresa Gessler
University of Zurich |
http://theresagessler.eu/
|
@th_ges
### 2022-05-12 --- <style> pre { white-space: pre-wrap; /* Since CSS 2.1 */ white-space: -moz-pre-wrap; /* Mozilla, since 1999 */ white-space: -pre-wrap; /* Opera 4-6 */ white-space: -o-pre-wrap; /* Opera 7 */ word-wrap: break-word; /* Internet Explorer 5.5+ */ } </style> <style type="text/css"> .remark-slide-content:after { content: "Theresa Gessler, Webscraping"; font-size: 80%; position: absolute; bottom: 0px; right: 75px; height: 40px; width: 420px; } </style> # Program <br><br><br> **scraping**: /ˈskreɪpɪŋ/, *to remove (an outer layer, for example) from a surface by forceful strokes of an edged or rough instrument* -- **web scraping**: to collect data from the web by removing the unnecessary parts (sometimes with a rough instrument) -- → Project to use this for text analysis --- # Program - downloading data from simple HTML pages - texts - tables - iteration - other scraping scenarios - APIs - dynamic pages - webcrawlers / spiders - task scheduling - (importing offline data) --- layout: true --- class: center, inverse, middle # Scraping Data from simple pages --- layout: true # Simple pages --- ## Scraping - extracting data from webpages - anything from university webpage to social media - lots of different techniques - today: **scraping simple static pages** -- - types of scraping - structured vs. **unstructured data** - *scenario 2: APIs* - **accessible** vs. protected data - *scenario 3: dynamic pages* - gathering as diverse information as possible from different pages vs. **very specific scrapers** - *scenario 4: web crawling* - **one-off scraping** vs. regular data collection - *scenario 5: task scheduling* --- ## Scenario 1: Static pages (what we'll cover) - extracting text, links and tables from many standard webpages - conditions - page written in HTML / XML - page has a static URL through which you can reach it -- ## Procedure - 'parsing' page in R - extracting relevant parts - cleaning into usable format (e.g. data frame, raw text, ...) --- <img src="../diagrams/Folie51.png" width="80%"> --- <img src="../diagrams/Folie52.png" width="80%"> --- ## HTML - **H**yper **T**ext **M**arkup **L**anguage - *markup*: additional description of formatting beyond the content of the text - language consists of **HTML tags** to specify character / behaviour of text - HTML tags typically consist of a starting and an end tag (exceptions: images, line breaks etc.) - many exceptions where it is not 'markup' - they surround the text they are formatting Example: `<tagname>Content goes here...</tagname> ` --- <img src="../diagrams/Folie52.png" width="80%"> --- <img src="../diagrams/Folie53.png" width="80%"> --- <img src="../diagrams/Folie54.png" width="80%"> --- <img src="../diagrams/Folie55.png" width="80%"> --- ## Webscraping HTML Pages - collecting data from HTML pages means removing the formatting but keeping any information it contains - 'parsing' of page structure - 'selecting' of parts of pages --- layout: true # HTML --- class: center, middle, inverse --- .wide-left[ <img src="../img/quotes2scrape_code.png" width="75%" /> ] .small-right[ ### Example HTML Code ] --- ## Basic HTML tags ``` <html> <head> <title>Title of your web page</title> </head> <body> HTML web page content </body> </html> ``` -- - we are mostly interested in what is inside the **body**, that is, the content of a webpage - **head** gives meta information, often used by search engines - tags can be **nested** --- ## Basic HTML Tags: Headings **Headings** are defined by numbered h tags. Examples (with code and outcome): `<h1> your heading</h1>` `<h2> a smaller heading</h2>` <h2> a smaller heading</h2> `<h3> an even smaller heading</h3>` <h3> an even smaller heading</h3> `<h4> an even smaller heading</h4>` <h4> an even smaller heading</h4> `<h5> an even smaller heading</h5>` <h5> an even smaller heading</h5> --- ## Basic HTML Tags: Paragraphs **Paragraphs** are defined by `div` or `p` tags. Examples: `<p>this is a paragraph.</p><p>and this is the next.</p>` <p>this is a paragraph.</p><p>and this is the next.</p> -- `<div>this is a paragraph.</div><div>and this is the next.</div>` <div>this is a paragraph.</div><div>and this is the next.</div> --- ## Basic HTML Tags: Attributes - All HTML elements can have attributes - Attributes provide additional information about an element - they are included inside the tag -- ### Usage - they are always specified in the starting tag - e.g. `<title attribute="x"> Title </title>` - Attributes usually come in name and value pairs - e.g. attributename="attributevalue" --- ## Basic HTML Tags: Attributes - Links - Most common case of attributes: **links** - text or images turned into a link by surrounding `<a>` tag (*anchor*) - link address specified as href attribute (*hyperreference*) Example: `This is text <a href="http://quotes.toscrape.com/">with a link</a>.` This is text <a href="http://quotes.toscrape.com/">with a link.</a> --- ## Basic HTML Tags: Attributes - other examples of attributes - src: location of an image - styles: formatting - classes: formatting for groups Examples: ` <img src="myimage.jpg"> ` ` <p style="color:red">This is a paragraph.</p> ` <p style="color:red">This is a paragraph.</p> `<p class="error">Red highlight</p>` --- ## Styling with Classes Webpages like blogs often define **Styles** and apply them to classes across the whole webpage. This use of classes is very common because it reduces the risk of accidentally formatting one instance of a repeated element differently. ``` <style> p.error { color: red; border: 1px solid red; } </style> <p class="error">Red highlight</p> ``` <style> p.error { color: red; border: 1px solid red; } </style> <p class="error">Red highlight</p> --- .wide-left[ <img src="../img/quotes2scrape_code.png" width="75%" /> ] .small-right[ Have another look at the webpage - do you understand more now? ] --- ## rvest: The Swiss army knife of scraping .left-column[<img src="img/rvest.png">] .right-column[ - r package for scraping - **strengths** - covers most frequent use cases - integration with other packages, e.g. tidyverse - **weaknesses** - relatively simple: no dynamic webpages ] -- <br> <br> <br> <br> <br> <br> ### Main uses - Tables - Texts - extracting links --- ## Overview of rvest commands Limited set of commands: - `read_html()` - `html_elements()` ( first occurence: `html_element()` ) - `html_text()` - `html_table()` - `html_attrs()` ( first occurence: `html_attr()` ) --- ## The American Presidency Project [https://www.presidency.ucsb.edu/](https://www.presidency.ucsb.edu/) <img src="img/presidency.jpg" width="60%"> --- ## 'Parsing' HTML - Example: [https://www.presidency.ucsb.edu/](https://www.presidency.ucsb.edu/) -- - scraping with R - you manually specify a resource - R sends request to server that hosts website - server returns resource - R parses HTML (i.e., interprets the structure), but does not render it in a nice fashion - you tell R which parts of the structure to focus on and what to extract -- ```r library(rvest) url <- "https://www.presidency.ucsb.edu/" page <- read_html(url) # returns parsed page ``` -- We practice this together in R --- ## The process of scraping <img src="diagrams/Folie11.png"> --- ## The process of scraping <img src="diagrams/Folie12.png"> --- ## CSS Selectors - we use the *appearance* / *style* of text to select specific parts - based on specific **HTML elements** - tags - classes - attributes - CSS selectors provide a *language* in which we can describe what we select at a more abstract level -- <style> p.testclass { color: red; border: 1px solid red; } </style> <p class="testclass">text to be selected</p> -- **"text to be selected"** ← vs. → **text in red, surrounded by a red border** --- ## Basic selectors -- <table border=0 width="100%"> <tr> <td width="20%">*</td> <td width="15%">universal selector</td> <td width="55%">Matches everything.</td> <td width="10%">*</td> </tr> -- <tr> <td>element</td> <td>element / type selector</td> <td>Matches an element</td> <td>p</td> </tr> -- <tr> <td>[attribute]</td> <td>attribute selector</td> <td>Matches elements containing a given attribute</td> <td>[href]</td> </tr> <tr> <td>[attribute=value]</td> <td>attribute selector</td> <td>Matches elements containing a given attribute with a given value</td> <td>[href=/]</td> </tr> -- <tr> <td>.class</td> <td>class selector</td> <td>Matches the value of a class attribute</td> <td>.header</td> </tr> -- <tr> <td>#id</td> <td>ID selector</td> <td>Matches the value of an id attribute</td> <td>#first</td> </tr> </table> --- ## More complex attribute selectors <table border=0 width="100%"> <tr> <td>[attribute*=value]</td> <td>Matches elements with an attribute that contains a given value</td> <td>a[href*="pressrelease"]</td> </tr> <tr> <td>[attribute^="value"]</td> <td>Matches elements with an attribute that starts with a given value</td> <td>a[href^="/press/"]</td> </tr> <tr> <td>[attribute$="value"]</td> <td>Matches elements with an attribute that ends with a given value</td> <td>[href$=".pdf"]</td> </tr> </table> --- ## Combining CSS Selectors There are several ways to combine CSS Selectors: <table border=0 width="100%"> <tr><td>element,element </td> <td>Selects all <div> elements and all <p> elements</td> <td>div, p </td></tr> <tr><td>element element </td> <td>Selects all <p> elements inside <div> elements</td> <td>div p </td></tr> <tr><td>element>element </td> <td>Selects all <p> elements where the parent is a <>div> element</td> <td>div > p </td></tr> <tr><td>element+element </td> <td>Selects all <p> elements that are placed immediately after <div> elements</td><td>div + p </td> </tr> <tr><td>element1~element2 </td> <td>Selects every <ul> element that are preceded by a <p> element</td> <td>p ~ ul </td></tr> </table> -- Dine at the [CSS Diner](https://flukeout.github.io/). And [use SelectorGadget](https://selectorgadget.com/) --- ## Selectorgadget - [Selectorgadget](https://selectorgadget.com/) - [Vignette](https://rvest.tidyverse.org/articles/selectorgadget.html) <img src="img/selectorgadget.jpg"> --- layout: true --- class: inverse, center, middle # Extracting links from webpages --- layout: true # Extracting links --- ## Extracting links from webpages - unlike a book that we read cover to cover, webpages distribute information over multiple pages - *hyperlinks* connect one page to the others → we follow them by clicking - we need to deal with this differently when scraping --- ## The process of scraping <img src="diagrams/Folie12.png"> --- ## The process of scraping <img src="diagrams/Folie13.png"> --- ## The process of scraping <img src="diagrams/Folie14.png"> --- ## Links in HTML - We discussed links as a common case of **attributes** - text (or other content) turned into a link with `<a>` tag (*anchor*) - link address specified as href attribute (*hyperreference*) -- `This is text <a href="https://www.presidency.ucsb.edu/">with a link</a>.` This is text <a href="https://www.presidency.ucsb.edu/">with a link</a>. --- ### Extracting links with rvest - extracting the text of the link - `html_elements(page,"a") %>% html_text()` - extracting the attribute of the link (the hyperreference) - `html_elements(page,"a") %>% html_attr("href")` -- ```r page <- read_html('This is text <a href="https://www.presidency.ucsb.edu/">with a link</a>.') html_elements(page,"a") %>% html_text() ``` ``` ## [1] "with a link" ``` ```r html_elements(page,"a") %>% html_attr("href") ``` ``` ## [1] "https://www.presidency.ucsb.edu/" ``` -- **Caution**: Link is attribute of `<a>`-Tag! `html_attr(page,"href")` **vs.** `html_elements(page,"a") %>% html_attr("href")` --- layout: true --- class: inverse, center, middle # Automation --- layout: true # Automation --- <img src="diagrams/Folie2.png"> --- <img src="diagrams/Folie3.png"> -- → How do we get from one page to multiple? --- ## Automation - repetition of code across different units -- → `for`-loops -- → `apply()` with functions --- ## `for`- loops <img src="diagrams/Folie6.png" width="70%"> --- ## `for`- loops <img src="diagrams/Folie7.png" width="70%"> `for (i in VECTOR){ do something with i }` `for (i in 1:2){ print(i) }` --- ## `for`- loops ### Example ```r for (i in 1:length(urls)){ text[i] <- read_html(urls[i]) %>% html_elements(".text") %>% html_text() } ``` --- ## `for`- loops ### Advantages - easy to write - do not require full translation of code into functions - easy to interrupt and continue for prolonged scraping -- ### Disadvantages - become inefficient for high numbers of iterations - no swag: [stackoverflow: Are For loops evil in R?](https://stackoverflow.com/questions/30240573/are-for-loops-evil-in-r) -- ### Good to know - loops with `for` are just the most well-known type of loop - `while` loops, `repeat` loops, `break` and `next` clauses --- ## Alternative: `sapply()` - Alternatively, we can define a [**function**](https://swcarpentry.github.io/r-novice-inflammation/02-func-R/) with scraping code ```r scrape_text <- function(url){ page <- read_html(url) page %>% html_element(".text") %>% html_text() } ``` - we use `sapply()` to apply the function to multiple URLs (see: [apply commands](https://www.guru99.com/r-apply-sapply-tapply.html)) ```r texts <- sapply(urls,scrape_text) ``` --- ## Sequence <img src="diagrams/Folie1.png" width="70%"> --- layout: true --- class: inverse, center, middle # Scraping Scenarios --- layout: true # Scenarios --- ## Scenario 1: Static pages (what we covered) - extracting text, links and tables from many standard webpages - page written in HTML / XML - page has a static URL through which you can reach it ## Procedure - 'parsing' page in R - extracting relevant parts - cleaning into usable format (e.g. data frame, raw text, ...) --- ## Scenario 2: APIs - companies and governments often provide **application programming interfaces** for their data - increasing accessibility, reliability - used for scraping and interaction with apps -- ### Differences to static pages - structured data in specific notation (often JSON) - access through sending requests - e.g. with `httr` package - specified regulations on extent and volume of access - in many cases: R packages for access to API - e.g. `gender`, `rtweet`, `WikipediR`, `tuber`, ... --- ## Scenario 2: APIs - pro - legal and robust to changes in webpage structure - highly standardized - con - not every page has API - rate limits / restrictions to amount of data - may be terminated: ['post API age'](https://www.tandfonline.com/doi/full/10.1080/10584609.2018.1477506) --- ## Scenario 3: dynamic pages - for scraping pages that change while you are on them without changing their URL - e.g. Buzzfeed, (many) search functions, pages without permanent URL -- ### Differences to static pages - simulates web browsing rather than parsing static page - navigation & scraping through commands to automated browser - primarily with `RSelenium` package -- - pro - get around many restrictions to scraping - possibility to automate browsing - con - difficult to set up - less robust than static scraping --- ## Scenario 4: web crawling / spiders - parsing of massive amounts of data - e.g. price data, building a search engine, ... - parsing of pages e.g. through `boilerpipeR` -- ### Differences to static pages - no selection of specific parts but use of *heuristics* on HTML code - → less exact but less labor-intensive extraction of content -- - pro - masses of data - con - masses of data (that are unclear) --- ## Scenario 5: Task scheduling - collecting data over time requires **regular updates**, e.g. - scraping daily front page news - updating a Corona infographic automatically with new cantonal data - task schedulers help us to create **automatic background tasks** so we do not need to manually execute the script at regular intervals -- ### Scheduling R tasks - [taskscheduleR](https://cran.r-project.org/web/packages/taskscheduleR/vignettes/taskscheduleR.html) or [cronR](https://cran.r-project.org/web/packages/cronR/vignettes/cronR.html) - or the [scheduler of your operating system](https://stackoverflow.com/questions/2793389/scheduling-r-script) - [youtube explainer](https://www.youtube.com/watch?v=ETu_xvOG_0k) --- ## Scenario 6: Importing offline data - `readtext`: R package to read text from documents into R (and quanteda) - [Documentation](https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html) - from **single files, folders, URLs, zips** - works for, plain text files **(.txt)**, JavaScript Object Notation **(.json)**, comma-or tab-separated values **(.csv, .tab, .tsv)**, XML documents **(.xml)**, PDF **(.pdf)**, Microsoft Word formatted files **(.doc, .docx)** -- ```r # all word documents texts <- readtext("*.docx") # all pdf documents in the slides folder docs <- readtext(file="slides/*.pdf") # all documents from a zip docs <- readtext(file="solutions/solutions.zip") ``` -- → check the IMF example script for bulk downloading PDFs and other files! --- layout: true --- ## Homework - **06_scraping.rmd, 06_singlefile.rmd, 06_scraping_briefings.rmd** - try `readtext` to read in some PDFs, e.g. your papers, your latest reading list, ... - optional: dine at the [CSS diner](https://flukeout.github.io/) -- ### Building on the course - find and (if possible) gather some **text data you want to use** - web content - a dataset - text documents - ...try things out! --- layout: true --- class: inverse, center, middle # Thank you! Questions?