library(tidyverse)
Web Scraping with rvest
Learning Objectives
- Basics of Web Scraping
- Chapter 24 of RDS
- Overview of rvest.
- SelectorGadget.
- Web Scraping
Data on the Web
There are at least 4 ways people download data on the web:
- Click to download a csv/xls/txt file.
- Use a package that interacts with an API.
- Use an API directly.
- Scrape directly from the HTML file.
This lesson, we talk about how to do 4.
Note: You shouldn’t download thousands of HTML files from a website to parse — the admins might block you if you send too many requests.
Note: Web scraping can be illegal in some circumstances, particularly if you intend to make money off of it or if you are collecting personal information (especially in Europe). I don’t give legal advice, so see Chapter 24 of RDS for some general recommendations, and talk to a lawyer if you are not sure.
Let’s load the tidyverse:
HTML / CSS
We have to know a little bit about HTML and CSS in order to understand how to extract certain elements from a website.
HTML stands for “HyperText Markup Language”
<html> <head> <title>My First Web Page</title> </head> <body> <h1>Welcome!</h1> <p>This is a <b>simple</b> paragraph.</p> <a href="https://en.wikipedia.org/">Wikipedia</a> </body> </html>
HTML consists of elements which start with a tag inside
<>
(like<head>
and<body>
), optional attributes that format the element (likehref=url
), contents (the text), and an end tag (like</head>
and</body>
). The above HTML text would be formatted like this:
Welcome!
This is a simple paragraph.
WikipediaImportant tags for web scraping:
<h1>
–<h6>
: Heading tags, with<h1>
as the highest (most important).<p>
: Paragraph of text.<a>
: Creates hyperlinks to other pages or resources.<img>
: Embeds an image.<div>
: Generic container for layout and styling.<span>
: Inline container for styling parts of text.<ul>
: Unordered list (bulleted).<ol>
: Ordered list (numbered).<li>
: List item, used inside<ul>
or<ol>
.<table>
: Defines a table structure.<tr>
: Table row.<td>
: Table data cell.<th>
: Table header cell.<strong>
: Strong importance (usually bold).<em>
: Emphasized text (usually italic).
CSS stands from “Cascading Style Sheets”. It’s a formatting language that indicates how HTML files should look. Every website you have been on is formatted with CSS.
Here is some example CSS:
h3 {color: red; font-style: italic; } .alert { footer divdisplay: none; }
The part before the curly braces is called a selector. It corresponds to HTML tags. Specifically, for those two they would correspond to:
<h3>Some text</h3> <footer> <div class="alert">More text</div> </footer>
The code inside the curly braces are properties. For example, the h3 properties tells us to make the h3 headers red and in italics. The second CSS chunk says that all
<div>
tags of class"alert"
in the<footer>
should be hidden.CSS applies the same properties to the same selectors. So every time we use h3 will result in the h3 styling of red and italicized text.
CSS selectors define patterns for selecting HTML elements. This is useful for scraping because we can extract all text in an HTML that corresponds to some CSS selector.
You can get a long way just selecting all
p
elements (standing for “paragraph”) since that is where a lot of text lives.The most common attributes used are
id
andclass
.- The selectors corresponding to
class
begin with a dot.
. - The selectors corresponding to
id
begin with a hashtak#
.
- The selectors corresponding to
The
.a
selector selects for “Text 1” in the following<p class="a">Text 1</p>
The
.a
selector selects for “Text 2” in the following<div class="a">Text 2</div>
The
#b
selector selects for “Text 3” in the following<p id="b">Text 3</p>
The
#b
selector selects for “Text 4” in the following<div id="b">Text 4</div>
More complicated selectors (from Richard Ressler):
- The
name
selector just uses thename
value of the element such ash3
. All elements with the same name value will be selected. - The
id
selector uses a#
, e.g.,#my_id
, to select a single element withid=my_id
(all ids are unique within a page). - The
class
selector uses a.
, e.g.,.my_class
, whereclass=my_class
. All elements with the same class value will be selected. - We can combine selectors with
.
,, and/or
\
to select a single element or groups of similar elements.- A selector of
my_name.my_class
combines name and class to select all (only) elements with thename=my_name
andclass=my_class
.
- A selector of
- The most important combinator is the white space,
, the descendant combination. As an example,
p a
selects all<a>
elements that are a child of (nested beneath) a<p>
element in the tree. - You can also find elements based on the values of attributes, e.g., find an element based on an attribute containing specific text.
- For a partial text search you would use
'[attribute_name*="my_text"]'
. Note the combination of single quotes and double quotes so you have double quotes around the value.
- For a partial text search you would use
- The
rvest
We’ll use
rvest
to extract elements from HTML files.library(rvest)
The typical pipeline for
rvest
is:- Load the html file into R using
read_html()
- Choose the selectors based on SelectorGadget (see below) or by inspecting the selectors manually using developer tools (see below).
- Select those selectors using
html_elements()
. - Extract the text using
html_text2()
.- Or, extract tables using
html_table()
.
- Or, extract tables using
- Extreme cleaning using 412/612 tools.
- Load the html file into R using
We’ll do a real example after we cover SelectorGadget and the web developer tools. But for now, let’s create a small html file:
<- minimal_html(' html <p class="a">Text 1</p> <div class="a">Text 2</div> <p id="b">Text 3</p> <div id="b">Text 4</div> ')
We can get all
p
tag text viahtml_elements(html, "p") |> html_text2()
[1] "Text 1" "Text 3"
We can get all
div
tag text viahtml_elements(html, "div") |> html_text2()
[1] "Text 2" "Text 4"
We can get all
class=a
text viahtml_elements(html, ".a") |> html_text2()
[1] "Text 1" "Text 2"
We can get all
id=b
text viahtml_elements(html, "#b") |> html_text2()
[1] "Text 3" "Text 4"
Once you use
html_elements()
, it’s common to then usehtml_element()
to extract even more information.<- minimal_html(" html_k <p><emph>A</emph>: <b>Ape</b> picks an <b>Apple</b> for <b>Aardvark</b> below.</p> <p><emph>L</emph>: <b>Lion</b> <b>Lifts</b> <b>Ladybug's</b> <b>Luggage</b></p> <p><emph>P</emph>: <b>Penguin</b> <b>Plays</b> with <b>Platypus</b> in the <b>Pool</b></p> ")
|> html_k html_elements("p") |> html_element("emph") |> html_text2()
[1] "A" "L" "P"
Exercise: Try extracting
b
with bothhtml_element()
andhtml_elements()
. What’s the difference?Exercise (from R4DS): Get all of the text from the
li
element below:<- minimal_html(" html <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> </ul> ")
Exercise (from R4DS): Extract the name of each droid. Start with the output of the second exercise.
Exercise (from R4DS): Use the class attribute of
weight
to extract the weight of each droid. Do not usespan
. Start with the output of the second exercise.
SelectorGadget
SelectorGadget is a tool for you to see what selector influences a particular element on a website.
To install SelectorGadget, drag this link to your bookmark bar on Chrome: SelectorGadget
Suppose we wanted to get the top 100 movies of all time from IMDB. The web page is very unstructured:
https://www.imdb.com/list/ls055592025/
If we click on the ranking of the Godfather, the “1” turns green (indicating what we have selected).
The “.text-primary” is the selector associated with the “1” we clicked on.
Everything highlighted in yellow also has the “.text-primary” selector associated with it.
We will also want the name of the movie. So if we click on that we get the selector associated with both the rank and the movie name: “a , .text-primary”.
But we also got a lot of stuff we don’t want (in yellow). If we click one of the yellow items that we don’t want, it turns red. This indicates that we don’t want to select it.
Only the ranking and the name remain, which are under the selector “.ipc-title-link-wrapper .ipc-title__text–reduced”.
It’s important to visually inspect the selected elements throughout the whole HTML file. SelectorGadget doesn’t always get all of what you want, or it sometimes gets too much.
Exercise: What selector can we use to get just the genres of each film, the metacritic score, and the IMDB rating?
Chrome developer tools:
If you have trouble with SelectorGadget, you can also use the Chrome developer tools.
Chrome works best for web scraping (better than Safari/Edge/Firefox/etc). So install it if you don’t have it.
Open up the list of selectors with: ⋮ > More tools > Developer tools.
Clicking on the element selector on the top left of the developer tools will show you what selectors are possible with each element.
More rvest
Let’s do a more complicated example of
rvest
.Use
read_html()
to save an HTML file to a variable. The variable will be an “xml_document
” object<- read_html("https://www.imdb.com/list/ls055592025/") html_obj html_objclass(html_obj)
XML stands for “Extensible Markup Language”. It’s a markup language (like HTML and Markdown), useful for representing data.
rvest
will store the HTML file as an XML.We can use
html_elements()
and the selectors we found in the previous section to get the elements we want. Insert the found selectors as thecss
argument.<- html_elements(html_obj, css = ".ipc-title-link-wrapper .ipc-title__text--reduced") ranking_elements head(ranking_elements)
{xml_nodeset (6)} [1] <h3 class="ipc-title__text ipc-title__text--reduced">1. The Godfather</h3> [2] <h3 class="ipc-title__text ipc-title__text--reduced">2. The Shawshank Red ... [3] <h3 class="ipc-title__text ipc-title__text--reduced">3. Schindler's List< ... [4] <h3 class="ipc-title__text ipc-title__text--reduced">4. Raging Bull</h3> [5] <h3 class="ipc-title__text ipc-title__text--reduced">5. Casablanca</h3> [6] <h3 class="ipc-title__text ipc-title__text--reduced">6. Citizen Kane</h3>
Note:
html_element()
is similar, but will return exactly one response per element, so is useful if some elements have missing components.To extract the text inside the obtained nodes, use
html_text()
orhtml_text2()
:html_text2()
just does a little more pre-formatting (like converting line breaks from HTML to R code, removing white spaces, etc). So you should typically use this.
<- html_text2(ranking_elements) ranking_text head(ranking_text)
[1] "1. The Godfather" "2. The Shawshank Redemption" [3] "3. Schindler's List" "4. Raging Bull" [5] "5. Casablanca" "6. Citizen Kane"
After you do this, you need to tidy the data using your data munging tools.
tibble(text = ranking_text) |> separate(col = "text", into = c("ranking", "movie"), sep = "\\.", extra = "merge") -> movierank
Exercise: Extract the directors and the names of each film. Try to use SelectorGadget to find your own selectors. But I eventually ended up using this selector:
".kKeVOw , .ipc-title-link-wrapper .ipc-title__text--reduced"
A very simple example
Here is a very simple html file that is generated using
rvest
:<- minimal_html(" html <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ")
The
h1
selector selects forh1
tags.html_elements(html, "h1") |> html_text()
[1] "This is a heading"
The
.important
selector selects forclass
attribute that isimportant
html_elements(html, ".important") |> html_text()
[1] "This is an important paragraph"
The
#first
selector selects forid
attribute that isfirst
html_elements(html, "#first") |> html_text()
[1] "This is a paragraph"
Bigger example using rvest
- Let’s try and get the name, rank, year, metascore for each movie.
html_elements(html_obj, ".dli-title-metadata-item:nth-child(1) , .hPmOUc, .ipc-title-link-wrapper .ipc-title__text--reduced") |>
html_text2() |>
tibble(text = _) |>
mutate(
title = str_detect(text, "\\d+\\."),
metascore = str_detect(text, "Metascore"),
year = !(title | metascore),
rank = cumsum(title),
type = case_when(
~ "Title",
title ~ "Year",
year ~ "Metascore")) |>
metascore select(rank, text, type) |>
pivot_wider(
names_from = type,
values_from = text) |>
mutate(
Title = str_remove(Title, "\\d+\\."),
Metascore = str_remove(Metascore, "Metascore"),
Title = str_squish(Title),
Year = parse_integer(Year),
Metascore = parse_integer(Metascore))
# A tibble: 100 × 4
rank Title Year Metascore
<int> <chr> <int> <int>
1 1 The Godfather 1972 100
2 2 The Shawshank Redemption 1994 82
3 3 Schindler's List 1993 95
4 4 Raging Bull 1980 90
5 5 Casablanca 1942 100
6 6 Citizen Kane 1941 100
7 7 Gone with the Wind 1939 97
8 8 The Wizard of Oz 1939 92
9 9 One Flew Over the Cuckoo's Nest 1975 84
10 10 Lawrence of Arabia 1962 100
# ℹ 90 more rows
html_table()
When data is in the form of a table, you can format it more easily with
html_table()
.The Wikipedia article on hurricanes: https://en.wikipedia.org/wiki/Atlantic_hurricane_season
This contains many tables which might be a pain to copy and paste into Excel (and we would be prone to error if we did so). Let’s try to automate this procedure.
Save the HTML
<- read_html("https://en.wikipedia.org/wiki/Atlantic_hurricane_season") wikixml
We’ll extract all of the “table” elements.
<- html_elements(wikixml, "table") wikidat
Use
html_table()
to get a list of tables from table elements:<- html_table(wikidat) tablist class(tablist)
[1] "list"
length(tablist)
[1] 20
19]] |> tablist[[select(1:4)
# A tibble: 11 × 4 Year Map `Number oftropical cyclones` `Number oftropical storms` <chr> <chr> <int> <int> 1 2010 "" 21 19 2 2011 "" 20 19 3 2012 "" 19 19 4 2013 "" 15 14 5 2014 "" 9 8 6 2015 "" 12 11 7 2016 "" 16 15 8 2017 "" 18 17 9 2018 "" 16 15 10 2019 "" 18 16 11 Total "Total" 164 153
You can clean up, bind, or merge these tables after you have read them in.
Exercise: The Wikipedia page on the oldest mosques in the world has many tables.
https://en.wikipedia.org/wiki/List_of_the_oldest_mosques
- Use
rvest
to read these tables into R. - Use
rvest
and SelectorGadget to extract out the category for the table (mentioned in Quran, in northeast Africa, etc). - Merge the data frames together. You only need to keep the building name, the country, and the time it was first build.
Hint: It’s easier if you use a css selector of
"table.wikitable"
to get the table rather than just"table"
. I found this out by getting to the developer tools in Chrome with CTRL + Shift + I then playing around with the tables.The first 15 rows should look like this:
# A tibble: 15 × 4 Building Country fb category <chr> <chr> <chr> <chr> 1 Al-Haram Mosque Saudi Arabia <NA> Mentioned in the Quran 2 Al-Aqsa Mosque Palestine <NA> Mentioned in the Quran 3 The Sacred Monument Saudi Arabia <NA> Mentioned in the Quran 4 Quba Mosque Saudi Arabia 622 Mentioned in the Quran 5 Mosque of the Companions Eritrea 610 Northeast Africa 6 Negash Āmedīn Mesgīd Ethiopia 620 Northeast Africa 7 Masjid al-Qiblatayn Somalia 620 Northeast Africa 8 Korijib Masjid Djibouti 630 Northeast Africa 9 Mosque of Amr ibn al-As Egypt 641 Northeast Africa 10 Mosque of Ibn Tulun Egypt 879 Northeast Africa 11 Al-Hakim Mosque Egypt 928 Northeast Africa 12 Al-Azhar Mosque Egypt 972 Northeast Africa 13 Arba'a Rukun Mosque Somalia 1268 Northeast Africa 14 Fakr ad-Din Mosque Somalia 1269 Northeast Africa 15 Great Mosque of Kairouan Tunisia 670 Northwest Africa
- Use