We collect a type of text document that offers many insights to economists: the International Monetary Fund Country reports. IMF Country Reports cover economic and financial developments and trends in member countries. Each report is prepared by a staff team after discussions with officials of the country. We will collect them from the IMF webpage.
First, try to find a suitable and interesting overview of country reports on the search page of the IMF. Choose for example a year, a keyword, …
Look at the list of results. Look at the URL for several pages: what changes when you move from page to page?
Can you identify a rule how to create the URLs for all pages relevant for you? (eg. all pages of results)
base_url <- "https://www.imf.org/en/Publications/Search?series=IMF%20Staff%20Country%20Reports&when=After&subject=Germany&page="
pages <- 1:7
results_urls <- paste0(base_url,pages)
Now, visit the first page of results. Write a basic scraper that collects the links, dates and report titles from this page.
resultspage <- read_html(results_urls[1])
links <- html_nodes(resultspage,"h6 a") %>% html_attr("href")
titles <- html_nodes(resultspage,"h6 a") %>% html_text(trim=T)
dates <- html_nodes(resultspage,"p:nth-child(4)") %>% html_text()
Now, write a loop or apply command to go through the pages and collect the links and titles of all the reports.
You will have to make sure to get the 10 reports and their URLs per results page in a usable format. Like always in R, there are multiple ways to do this. If you do not manage, just save whatever result you have to an object to return to it later.
scrape_reports <- function(page){
parsed <- read_html(page)
links <- html_nodes(parsed, "h6 a") %>% html_attr("href")
titles <- html_nodes(parsed, "h6 a") %>% html_text(trim=T)
dates <- html_nodes(parsed,"p:nth-child(4)") %>% html_text(trim=T)
df <- data.frame(links,titles,dates)
}
reports <- lapply(results_urls,scrape_reports)
reports_germany <- do.call(rbind,reports)
reports_germany <- reports_germany %>%
mutate(date=str_remove(dates,"Date: "))
Visit one of the links.
Unfortunately, as you’ll see, there’s still another click separating us from the pdf file!
Try to find the download link for the pdf with selectorgadget.
Sometimes there may not be a link because the pdf is not accessible. You can use error handling or if conditions to account for this. For example, instead of just assigning the link to the object in which you want to store the filenames, first check its length: if (length(link)==1){pdf_links[i]<-link}
reports_germany$url <- paste0("https://www.imf.org",reports_germany$links)
pdf_links <- character(length(reports_germany$url))
for (i in 1:length(reports_germany$url)){
page<-read_html(reports_germany$url[i])
link <- page %>%
html_nodes(".piwik_download") %>%
html_attr("href")
if (length(link)==1){pdf_links[i]<-link}
}
Afterwards, look up the documentation for the download.file()
command in R and use it to download the pdf. Try this out with a single file and have a look at it first because you may need to adjust the settings depending on your operating system.(I can help you with that)
Write a scraper to download the reports. If you have not managed to get the URLs of all the results pages in a usable format, just try the command with a single page.
Use the if-condition to download only when the URL exists.
reports_germany$pdf_links <- paste0("https://www.imf.org",pdf_links)
dir.create("reports_germany")
for (i in 1:length(reports_germany$pdf_links)){
if(pdf_links[i]!=""){
download.file(url=reports_germany$pdf_links[i],
destfile=paste0("reports_germany/",basename(reports_germany$url[i]),".pdf"),
mode="wb")}
}
save(reports_germany,file="reports_germany.RData")