Webscraping and the Text Analysis Pipeline

# Webscraping and the Text Analysis Pipeline
## Computational Text Analysis
### Theresa Gessler <br> University of Zurich | <a href="http://theresagessler.eu/" class="uri">http://theresagessler.eu/</a> | <span class="citation">@th_ges</span>
### 2022-05-12

---

.remark-slide-content:after {
    content: "Theresa Gessler, Webscraping";
    font-size: 80%;
    position: absolute;
    bottom: 0px;
    right: 75px;
    height: 40px;
    width: 420px;
}
</style>

# Program

**scraping**:   /ˈskreɪpɪŋ/, *to remove (an outer layer, for example) from a surface by forceful strokes of an edged or rough instrument*

**web scraping**: to collect data from the web by removing the unnecessary parts (sometimes with a rough instrument)

&rarr; Project to use this for text analysis

---

# Program

- downloading data from simple HTML pages
    - texts
    - tables
    - iteration
- other scraping scenarios
    - APIs
    - dynamic pages
    - webcrawlers / spiders
    - task scheduling
    - (importing offline data)

---

---
class: center, inverse, middle

# Scraping Data from simple pages

---

# Simple pages
---

## Scraping

- extracting data from webpages
    - anything from university webpage to social media
    - lots of different techniques
- today: **scraping simple static pages**

--
- types of scraping
    - structured vs. **unstructured data** 
        - *scenario 2: APIs*
    - **accessible** vs. protected data 
        - *scenario 3: dynamic pages*
    - gathering as diverse information as possible from different pages vs. **very specific scrapers** 
        - *scenario 4: web crawling*
    - **one-off scraping** vs. regular data collection 
        - *scenario 5: task scheduling*

---

## Scenario 1: Static pages (what we'll cover)

- extracting text, links and tables from many standard webpages
- conditions
    - page written in HTML / XML
    - page has a static URL through which you can reach it

## Procedure

- 'parsing' page in R
- extracting relevant parts
- cleaning into usable format (e.g. data frame, raw text, ...)

---

---

---

## HTML

- **H**yper **T**ext **M**arkup **L**anguage
    - *markup*: additional description of formatting beyond the content of the text
- language consists of **HTML tags** to specify character / behaviour of text
- HTML tags typically consist of a starting and an end tag (exceptions: images, line breaks etc.)
    - many exceptions where it is not 'markup'
- they surround the text they are formatting

Example:

`<tagname>Content goes here...</tagname> `
---

---

---

---

---

## Webscraping HTML Pages

- collecting data from HTML pages means removing the formatting but keeping any information it contains
    - 'parsing' of page structure
    - 'selecting' of parts of pages

---
layout: true
# HTML

---

---

.wide-left[
<img src="../img/quotes2scrape_code.png" width="75%" />
]
.small-right[
### Example HTML Code

]

---

## Basic HTML tags

``` 
<html> 
    <head> 
        <title>Title of your web page</title> 
    </head> 
    <body> 
    HTML web page content 
    </body> 
</html> 
```
--

- we are mostly interested in what is inside the **body**, that is, the content of a webpage
- **head** gives meta information, often used by search engines
- tags can be **nested**

---

## Basic HTML Tags: Headings

**Headings** are defined by numbered h tags. Examples (with code and outcome):

`<h1> your heading</h1>`

`<h2> a smaller heading</h2>`
<h2> a smaller heading</h2>

`<h3> an even smaller heading</h3>`
<h3> an even smaller heading</h3>

`<h4> an even smaller heading</h4>`
<h4> an even smaller heading</h4>

`<h5> an even smaller heading</h5>`
<h5> an even smaller heading</h5>

---

## Basic HTML Tags: Paragraphs

**Paragraphs** are defined by `div` or `p` tags.

Examples:

`<p>this is a paragraph.</p><p>and this is the next.</p>`
<p>this is a paragraph.</p><p>and this is the next.</p>

`<div>this is a paragraph.</div><div>and this is the next.</div>`
<div>this is a paragraph.</div><div>and this is the next.</div>

---

## Basic HTML Tags: Attributes

- All HTML elements can have attributes
- Attributes provide additional information about an element
    - they are included inside the tag

### Usage

- they are always specified in the starting tag
    - e.g. `<title attribute="x"> Title </title>`
- Attributes usually come in name and value pairs
    - e.g. attributename="attributevalue"

---

## Basic HTML Tags: Attributes - Links

- Most common case of attributes: **links**
    - text or images turned into a link by surrounding `<a>` tag (*anchor*)
    - link address specified as href attribute (*hyperreference*)

Example:

`This is text <a href="http://quotes.toscrape.com/">with a link</a>.`

This is text <a href="http://quotes.toscrape.com/">with a link.</a>

---

## Basic HTML Tags: Attributes

- other examples of attributes
    - src: location of an image
    - styles: formatting
    - classes: formatting for groups

Examples:

` <img src="myimage.jpg"> `

` <p style="color:red">This is a paragraph.</p> `
 <p style="color:red">This is a paragraph.</p>

`<p class="error">Red highlight</p>`

---

## Styling with Classes

Webpages like blogs often define **Styles** and apply them to classes across the whole webpage. This use of classes is very common because it reduces the risk of accidentally formatting one instance of a repeated element differently.

```
<style>
p.error {
  color: red;   border: 1px solid red;
} 
</style>

<p class="error">Red highlight</p>
```
<style>
p.error {
  color: red; border: 1px solid red;
} 
</style>
<p class="error">Red highlight</p>

---

.wide-left[
<img src="../img/quotes2scrape_code.png" width="75%" />
]
.small-right[
Have another look at the webpage - do you understand more now?
]

---

## rvest: The Swiss army knife of scraping

.left-column[<img src="img/rvest.png">]
.right-column[
- r package for scraping
- **strengths**
    - covers most frequent use cases
    - integration with other packages, e.g. tidyverse
- **weaknesses**
    - relatively simple: no dynamic webpages

]

### Main uses

- Tables
- Texts
- extracting links

---

## Overview of rvest commands

Limited set of commands:

- `read_html()`
- `html_elements()` ( first occurence: `html_element()` )
- `html_text()`
- `html_table()`
- `html_attrs()` ( first occurence: `html_attr()` )

---

## The American Presidency Project

[https://www.presidency.ucsb.edu/](https://www.presidency.ucsb.edu/)

---

## 'Parsing' HTML

- Example: [https://www.presidency.ucsb.edu/](https://www.presidency.ucsb.edu/)

- scraping with R
    - you manually specify a resource
    - R sends request to server that hosts website
    - server returns resource
    - R parses HTML (i.e., interprets the structure), but does not render it in a nice fashion
    - you tell R which parts of the structure to focus on and what to extract

```r
library(rvest)
url <- "https://www.presidency.ucsb.edu/"
page <- read_html(url) # returns parsed page
```
--

We practice this together in R

---

## The process of scraping

---

## The process of scraping

---

## CSS Selectors

- we use the *appearance* / *style* of text to select specific parts
- based on specific **HTML elements**
    - tags
    - classes
    - attributes
- CSS selectors provide a *language* in which we can describe what we select at a more abstract level

<p class="testclass">text to be selected</p>

**"text to be selected"** &larr; vs. &rarr; **text in red, surrounded by a red border**

---

## Basic selectors

--
<table border=0 width="100%">
<tr>
<td width="20%">*</td>
<td width="15%">universal selector</td>
<td width="55%">Matches everything.</td>
<td width="10%">*</td>
</tr>
--
<tr>
<td>element</td>
<td>element / type selector</td>
<td>Matches an element</td>
<td>p</td>
</tr>
--
<tr>
<td>&lsqb;attribute&rsqb;</td>
<td>attribute selector</td>
<td>Matches elements containing a given attribute</td>
<td>[href]</td>
</tr>
<tr>
<td>&lsqb;attribute=value&rsqb;</td>
<td>attribute selector</td>
<td>Matches elements containing a given attribute with a given value</td>
<td>[href=/]</td>
</tr>
--
<tr>
<td>.class</td>
<td>class selector</td>
<td>Matches the value of a class attribute</td>
<td>.header</td>
</tr>
--
<tr>
<td>#id</td>
<td>ID selector</td>
<td>Matches the value of an id attribute</td>
<td>#first</td>
</tr>
</table>

---

## More complex attribute selectors

<table border=0 width="100%">
<tr>
<td>&lsqb;attribute*=value&rsqb;</td>
<td>Matches elements with an attribute that contains a given value</td>
<td>a&lsqb;href*="pressrelease"&rsqb;</td>
</tr>
<tr>
<td>&lsqb;attribute^="value"&rsqb;</td>
<td>Matches elements with an attribute that starts with a given value</td>
<td>a&lsqb;href^="/press/"&rsqb;</td>
</tr>
<tr>
<td>&lsqb;attribute&dollar;="value"&rsqb;</td>
<td>Matches elements with an attribute that ends with a given value</td>
<td>&lsqb;href$=".pdf"&rsqb;</td>
</tr>
</table>

---

## Combining CSS Selectors

There are several ways to combine CSS Selectors:

<table border=0 width="100%">
<tr><td>element,element   </td>

<td>Selects all &lt;div&gt; elements and all &lt;p&gt; elements</td> <td>div, p   </td></tr>
<tr><td>element element   </td>

<td>Selects all &lt;p&gt; elements inside &lt;div&gt; elements</td> <td>div p   </td></tr>
<tr><td>element>element   </td>

<td>Selects all &lt;p&gt; elements where the parent is a &lt;>div&gt; element</td> <td>div > p   </td></tr>
<tr><td>element+element   </td>

<td>Selects all &lt;p&gt; elements that are placed immediately after &lt;div&gt; elements</td><td>div + p   </td> </tr>
<tr><td>element1~element2   </td>

<td>Selects every &lt;ul&gt; element that are preceded by a &lt;p&gt; element</td> <td>p ~ ul   </td></tr>
</table>

Dine at the [CSS Diner](https://flukeout.github.io/). And [use SelectorGadget](https://selectorgadget.com/)

---

## Selectorgadget

- [Selectorgadget](https://selectorgadget.com/)
- [Vignette](https://rvest.tidyverse.org/articles/selectorgadget.html)

---

---
class: inverse, center, middle

# Extracting links from webpages

---

# Extracting links

---
## Extracting links from webpages

- unlike a book that we read cover to cover,  webpages distribute information over multiple pages
- *hyperlinks* connect one page to the others &rarr; we follow them by clicking
    - we need to deal with this differently when scraping

---

## The process of scraping

---

## The process of scraping

---

## The process of scraping

---

## Links in HTML

- We discussed links as a common case of **attributes**
    - text (or other content) turned into a link with `<a>` tag (*anchor*)
    - link address specified as href attribute (*hyperreference*)

`This is text <a href="https://www.presidency.ucsb.edu/">with a link</a>.`

This is text <a href="https://www.presidency.ucsb.edu/">with a link</a>.

---

### Extracting links with rvest

- extracting the text of the link
    - `html_elements(page,"a") %>% html_text()`
- extracting the attribute of the link (the hyperreference)
    - `html_elements(page,"a") %>% html_attr("href")`

```r
page <- read_html('This is text <a href="https://www.presidency.ucsb.edu/">with a link</a>.') 
html_elements(page,"a") %>% html_text()
```

```
## [1] "with a link"
```

```r
html_elements(page,"a") %>% html_attr("href")
```

```
## [1] "https://www.presidency.ucsb.edu/"
```

**Caution**: Link is attribute of `<a>`-Tag! `html_attr(page,"href")` **vs.** `html_elements(page,"a") %>% html_attr("href")`

---

---
class: inverse, center, middle

# Automation

---

# Automation

---
<img src="diagrams/Folie2.png">

---

&rarr; How do we get from one page to multiple?

---

## Automation

- repetition of code across different units

&rarr; `for`-loops

&rarr; `apply()` with functions

---

## `for`- loops

---

## `for`- loops

`for (i in VECTOR){ do something with i }`

`for (i in 1:2){ print(i) }`

---

## `for`- loops

### Example

```r
for (i in 1:length(urls)){
  text[i] <- read_html(urls[i]) %>% 
    html_elements(".text") %>% 
    html_text()
}
```

---

## `for`- loops

### Advantages

- easy to write
- do not require full translation of code into functions
- easy to interrupt and continue for prolonged scraping

### Disadvantages

- become inefficient for high numbers of iterations
- no swag: [stackoverflow: Are For loops evil in R?](https://stackoverflow.com/questions/30240573/are-for-loops-evil-in-r)

### Good to know

- loops with `for` are just the most well-known type of loop
    - `while` loops, `repeat` loops, `break` and `next` clauses

---

## Alternative: `sapply()`

- Alternatively, we can define a [**function**](https://swcarpentry.github.io/r-novice-inflammation/02-func-R/) with scraping code

```r
scrape_text <- function(url){
    page <- read_html(url)
    page %>%
    html_element(".text") %>%
    html_text()
}
```

- we use `sapply()` to apply the function to multiple URLs (see: [apply commands](https://www.guru99.com/r-apply-sapply-tapply.html))

```r
texts <- sapply(urls,scrape_text)
```

---

## Sequence

---

---

# Scraping Scenarios

---

# Scenarios

---

## Scenario 1: Static pages (what we covered)

- extracting text, links and tables from many standard webpages
    - page written in HTML / XML
    - page has a static URL through which you can reach it

## Procedure

- 'parsing' page in R
- extracting relevant parts
- cleaning into usable format (e.g. data frame, raw text, ...)

---

## Scenario 2: APIs

- companies and governments often provide **application programming interfaces** for their data
    - increasing accessibility, reliability
    - used for scraping and interaction with apps

### Differences to static pages

- structured data in specific notation (often JSON)
- access through sending requests 
    - e.g. with `httr` package
- specified regulations on extent and volume of access
- in many cases: R packages for access to API
    - e.g. `gender`, `rtweet`, `WikipediR`, `tuber`, ...

---

## Scenario 2: APIs

- pro
    - legal and robust to changes in webpage structure
    - highly standardized
- con
    - not every page has API
    - rate limits / restrictions to amount of data
    - may be terminated: ['post API age'](https://www.tandfonline.com/doi/full/10.1080/10584609.2018.1477506)

---

## Scenario 3: dynamic pages

- for scraping pages that change while you are on them without changing their URL
    - e.g. Buzzfeed, (many) search functions, pages without permanent URL

### Differences to static pages

- simulates web browsing rather than parsing static page
    - navigation & scraping through commands to automated browser
    - primarily with `RSelenium` package

- pro
    - get around many restrictions to scraping
    - possibility to automate browsing
- con
    - difficult to set up
    - less robust than static scraping

---

## Scenario 4: web crawling / spiders

- parsing of massive amounts of data
    - e.g. price data, building a search engine, ...
- parsing of pages e.g. through `boilerpipeR`

### Differences to static pages

- no selection of specific parts but use of *heuristics* on HTML code
    - &rarr; less exact but less labor-intensive extraction of content

- pro
    - masses of data
- con
    - masses of data (that are unclear)

---

## Scenario 5: Task scheduling

- collecting data over time requires **regular updates**, e.g.
    - scraping daily front page news
    - updating a Corona infographic automatically with new cantonal data
- task schedulers help us to create **automatic background tasks** so we do not need to manually execute the script at regular intervals

### Scheduling R tasks

- [taskscheduleR](https://cran.r-project.org/web/packages/taskscheduleR/vignettes/taskscheduleR.html) or [cronR](https://cran.r-project.org/web/packages/cronR/vignettes/cronR.html) - or the [scheduler of your operating system](https://stackoverflow.com/questions/2793389/scheduling-r-script)
    - [youtube explainer](https://www.youtube.com/watch?v=ETu_xvOG_0k)

---

## Scenario 6: Importing offline data

- `readtext`: R package to read text from documents into R (and quanteda)
    - [Documentation](https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html)

- from **single files, folders, URLs, zips**
- works for, plain text files **(.txt)**, JavaScript Object Notation **(.json)**, comma-or tab-separated values **(.csv, .tab, .tsv)**, XML documents **(.xml)**, PDF **(.pdf)**, Microsoft Word formatted files **(.doc, .docx)**

```r
# all word documents
texts <- readtext("*.docx")
# all pdf documents in the slides folder
docs <- readtext(file="slides/*.pdf")
# all documents from a zip
docs <- readtext(file="solutions/solutions.zip")
```

&rarr; check the IMF example script for bulk downloading PDFs and other files!

---

---

## Homework

- **06_scraping.rmd, 06_singlefile.rmd, 06_scraping_briefings.rmd**
- try `readtext` to read in some PDFs, e.g. your papers, your latest reading list, ...
- optional: dine at the [CSS diner](https://flukeout.github.io/)

### Building on the course

- find and (if possible) gather some **text data you want to use**
    - web content
    - a dataset
    - text documents
- ...try things out!

---

---

# Thank you! Questions?