Web Scraping with rvest

Author

David Gerard

Published

June 23, 2025

Learning Objectives

Data on the Web

  • There are at least 4 ways people download data on the web:

    1. Click to download a csv/xls/txt file.
    2. Use a package that interacts with an API.
    3. Use an API directly.
    4. Scrape directly from the HTML file.
  • This lesson, we talk about how to do 4.

  • Note: You shouldn’t download thousands of HTML files from a website to parse — the admins might block you if you send too many requests.

  • Note: Web scraping can be illegal in some circumstances, particularly if you intend to make money off of it or if you are collecting personal information (especially in Europe). I don’t give legal advice, so see Chapter 24 of RDS for some general recommendations, and talk to a lawyer if you are not sure.

  • Let’s load the tidyverse:

    library(tidyverse)

HTML / CSS

  • We have to know a little bit about HTML and CSS in order to understand how to extract certain elements from a website.

  • HTML stands for “HyperText Markup Language”

    <html>
    <head>
      <title>My First Web Page</title>
    </head>
    <body>
      <h1>Welcome!</h1>
      <p>This is a <b>simple</b> paragraph.</p>
      <a href="https://en.wikipedia.org/">Wikipedia</a>
    </body>
    </html>
  • HTML consists of elements which start with a tag inside <> (like <head> and <body>), optional attributes that format the element (like href=url), contents (the text), and an end tag (like </head> and </body>). The above HTML text would be formatted like this:


My First Web Page

Welcome!

This is a simple paragraph.

Wikipedia
  • Important tags for web scraping:

    • <h1><h6>: Heading tags, with <h1> as the highest (most important).
    • <p>: Paragraph of text.
    • <a>: Creates hyperlinks to other pages or resources.
    • <img>: Embeds an image.
    • <div>: Generic container for layout and styling.
    • <span>: Inline container for styling parts of text.
    • <ul>: Unordered list (bulleted).
    • <ol>: Ordered list (numbered).
    • <li>: List item, used inside <ul> or <ol>.
    • <table>: Defines a table structure.
    • <tr>: Table row.
    • <td>: Table data cell.
    • <th>: Table header cell.
    • <strong>: Strong importance (usually bold).
    • <em>: Emphasized text (usually italic).
  • CSS stands from “Cascading Style Sheets”. It’s a formatting language that indicates how HTML files should look. Every website you have been on is formatted with CSS.

  • Here is some example CSS:

    h3 {
      color: red;
      font-style: italic;
    }
    
    footer div.alert {
      display: none;
    }
  • The part before the curly braces is called a selector. It corresponds to HTML tags. Specifically, for those two they would correspond to:

    <h3>Some text</h3>
    
    <footer>
    <div class="alert">More text</div>
    </footer>
  • The code inside the curly braces are properties. For example, the h3 properties tells us to make the h3 headers red and in italics. The second CSS chunk says that all <div> tags of class "alert" in the <footer> should be hidden.

  • CSS applies the same properties to the same selectors. So every time we use h3 will result in the h3 styling of red and italicized text.

  • CSS selectors define patterns for selecting HTML elements. This is useful for scraping because we can extract all text in an HTML that corresponds to some CSS selector.

  • You can get a long way just selecting all p elements (standing for “paragraph”) since that is where a lot of text lives.

  • The most common attributes used are id and class.

    • The selectors corresponding to class begin with a dot ..
    • The selectors corresponding to id begin with a hashtak #.
  • The .a selector selects for “Text 1” in the following

    <p class="a">Text 1</p>
  • The .a selector selects for “Text 2” in the following

    <div class="a">Text 2</div>
  • The #b selector selects for “Text 3” in the following

    <p id="b">Text 3</p>
  • The #b selector selects for “Text 4” in the following

    <div id="b">Text 4</div>
  • More complicated selectors (from Richard Ressler):

    • The name selector just uses the name value of the element such as h3. All elements with the same name value will be selected.
    • The id selector uses a #, e.g., #my_id, to select a single element with id=my_id (all ids are unique within a page).
    • The class selector uses a ., e.g., .my_class, where class=my_class. All elements with the same class value will be selected.
    • We can combine selectors with ., , and/or \ to select a single element or groups of similar elements.
      • A selector of my_name.my_class combines name and class to select all (only) elements with the name=my_name and class=my_class.
    • The most important combinator is the white space, , the descendant combination. As an example, p a selects all <a> elements that are a child of (nested beneath) a <p> element in the tree.
    • You can also find elements based on the values of attributes, e.g., find an element based on an attribute containing specific text.
      • For a partial text search you would use '[attribute_name*="my_text"]'. Note the combination of single quotes and double quotes so you have double quotes around the value.

rvest

  • We’ll use rvest to extract elements from HTML files.

    library(rvest)
  • The typical pipeline for rvest is:

    • Load the html file into R using read_html()
    • Choose the selectors based on SelectorGadget (see below) or by inspecting the selectors manually using developer tools (see below).
    • Select those selectors using html_elements().
    • Extract the text using html_text2().
      • Or, extract tables using html_table().
    • Extreme cleaning using 412/612 tools.
  • We’ll do a real example after we cover SelectorGadget and the web developer tools. But for now, let’s create a small html file:

    html <- minimal_html('
    <p class="a">Text 1</p>
    <div class="a">Text 2</div>
    <p id="b">Text 3</p>
    <div id="b">Text 4</div>
    ')
  • We can get all p tag text via

    html_elements(html, "p") |>
      html_text2()
    [1] "Text 1" "Text 3"
  • We can get all div tag text via

    html_elements(html, "div") |>
      html_text2()
    [1] "Text 2" "Text 4"
  • We can get all class=a text via

    html_elements(html, ".a") |>
      html_text2()
    [1] "Text 1" "Text 2"
  • We can get all id=b text via

    html_elements(html, "#b") |>
      html_text2()
    [1] "Text 3" "Text 4"
  • Once you use html_elements(), it’s common to then use html_element() to extract even more information.

    html_k <- minimal_html("
    <p><emph>A</emph>: <b>Ape</b> picks an <b>Apple</b> for <b>Aardvark</b> below.</p>
    <p><emph>L</emph>: <b>Lion</b> <b>Lifts</b> <b>Ladybug's</b> <b>Luggage</b></p>
    <p><emph>P</emph>: <b>Penguin</b> <b>Plays</b> with <b>Platypus</b> in the <b>Pool</b></p>
    ")
    html_k |>
      html_elements("p") |>
      html_element("emph") |>
      html_text2()
    [1] "A" "L" "P"
  • Exercise: Try extracting b with both html_element() and html_elements(). What’s the difference?

  • Exercise (from R4DS): Get all of the text from the li element below:

    html <- minimal_html("
      <ul>
        <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
        <li><b>R4-P17</b> is a <i>droid</i></li>
        <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
        <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
      </ul>
      ")
  • Exercise (from R4DS): Extract the name of each droid. Start with the output of the second exercise.

  • Exercise (from R4DS): Use the class attribute of weight to extract the weight of each droid. Do not use span. Start with the output of the second exercise.

SelectorGadget

  • SelectorGadget is a tool for you to see what selector influences a particular element on a website.

  • To install SelectorGadget, drag this link to your bookmark bar on Chrome: SelectorGadget

  • Suppose we wanted to get the top 100 movies of all time from IMDB. The web page is very unstructured:

    https://www.imdb.com/list/ls055592025/

     

  • If we click on the ranking of the Godfather, the “1” turns green (indicating what we have selected).

     

  • The “.text-primary” is the selector associated with the “1” we clicked on.

  • Everything highlighted in yellow also has the “.text-primary” selector associated with it.

  • We will also want the name of the movie. So if we click on that we get the selector associated with both the rank and the movie name: “a , .text-primary”.

     

  • But we also got a lot of stuff we don’t want (in yellow). If we click one of the yellow items that we don’t want, it turns red. This indicates that we don’t want to select it.

     

  • Only the ranking and the name remain, which are under the selector “.ipc-title-link-wrapper .ipc-title__text–reduced”.

  • It’s important to visually inspect the selected elements throughout the whole HTML file. SelectorGadget doesn’t always get all of what you want, or it sometimes gets too much.

  • Exercise: What selector can we use to get just the genres of each film, the metacritic score, and the IMDB rating?

Chrome developer tools:

  • If you have trouble with SelectorGadget, you can also use the Chrome developer tools.

  • Chrome works best for web scraping (better than Safari/Edge/Firefox/etc). So install it if you don’t have it.

  • Open up the list of selectors with: ⋮ > More tools > Developer tools.

  • Clicking on the element selector on the top left of the developer tools will show you what selectors are possible with each element.

     

More rvest

  • Let’s do a more complicated example of rvest.

  • Use read_html() to save an HTML file to a variable. The variable will be an “xml_document” object

    html_obj <- read_html("https://www.imdb.com/list/ls055592025/")
    html_obj
    class(html_obj)
  • XML stands for “Extensible Markup Language”. It’s a markup language (like HTML and Markdown), useful for representing data. rvest will store the HTML file as an XML.

  • We can use html_elements() and the selectors we found in the previous section to get the elements we want. Insert the found selectors as the css argument.

    ranking_elements <- html_elements(html_obj, css = ".ipc-title-link-wrapper .ipc-title__text--reduced")
    head(ranking_elements)
    {xml_nodeset (6)}
    [1] <h3 class="ipc-title__text ipc-title__text--reduced">1. The Godfather</h3>
    [2] <h3 class="ipc-title__text ipc-title__text--reduced">2. The Shawshank Red ...
    [3] <h3 class="ipc-title__text ipc-title__text--reduced">3. Schindler's List< ...
    [4] <h3 class="ipc-title__text ipc-title__text--reduced">4. Raging Bull</h3>
    [5] <h3 class="ipc-title__text ipc-title__text--reduced">5. Casablanca</h3>
    [6] <h3 class="ipc-title__text ipc-title__text--reduced">6. Citizen Kane</h3>
  • Note: html_element() is similar, but will return exactly one response per element, so is useful if some elements have missing components.

  • To extract the text inside the obtained nodes, use html_text() or html_text2():

    • html_text2() just does a little more pre-formatting (like converting line breaks from HTML to R code, removing white spaces, etc). So you should typically use this.
    ranking_text <- html_text2(ranking_elements)
    head(ranking_text)
    [1] "1. The Godfather"            "2. The Shawshank Redemption"
    [3] "3. Schindler's List"         "4. Raging Bull"             
    [5] "5. Casablanca"               "6. Citizen Kane"            
  • After you do this, you need to tidy the data using your data munging tools.

    tibble(text = ranking_text) |>
      separate(col = "text", into = c("ranking", "movie"), sep = "\\.", extra = "merge") ->
    movierank
  • Exercise: Extract the directors and the names of each film. Try to use SelectorGadget to find your own selectors. But I eventually ended up using this selector: ".kKeVOw , .ipc-title-link-wrapper .ipc-title__text--reduced"

A very simple example

  • Here is a very simple html file that is generated using rvest:

    html <- minimal_html("
      <h1>This is a heading</h1>
      <p id='first'>This is a paragraph</p>
      <p class='important'>This is an important paragraph</p>
    ")
  • The h1 selector selects for h1 tags.

    html_elements(html, "h1") |>
      html_text()
    [1] "This is a heading"
  • The .important selector selects for class attribute that is important

    html_elements(html, ".important") |>
      html_text()
    [1] "This is an important paragraph"
  • The #first selector selects for id attribute that is first

    html_elements(html, "#first") |>
      html_text()
    [1] "This is a paragraph"

Bigger example using rvest

  • Let’s try and get the name, rank, year, metascore for each movie.
html_elements(html_obj, ".dli-title-metadata-item:nth-child(1) , .hPmOUc, .ipc-title-link-wrapper .ipc-title__text--reduced") |>
  html_text2() |>
  tibble(text = _) |>
  mutate(
    title = str_detect(text, "\\d+\\."),
    metascore = str_detect(text, "Metascore"),
    year = !(title | metascore),
    rank = cumsum(title),
    type = case_when(
      title ~ "Title",
      year ~ "Year",
      metascore ~ "Metascore")) |>
  select(rank, text, type) |>
  pivot_wider(
    names_from = type,
    values_from = text) |>
  mutate(
    Title = str_remove(Title, "\\d+\\."),
    Metascore = str_remove(Metascore, "Metascore"),
    Title = str_squish(Title),
    Year = parse_integer(Year),
    Metascore = parse_integer(Metascore))
# A tibble: 100 × 4
    rank Title                            Year Metascore
   <int> <chr>                           <int>     <int>
 1     1 The Godfather                    1972       100
 2     2 The Shawshank Redemption         1994        82
 3     3 Schindler's List                 1993        95
 4     4 Raging Bull                      1980        90
 5     5 Casablanca                       1942       100
 6     6 Citizen Kane                     1941       100
 7     7 Gone with the Wind               1939        97
 8     8 The Wizard of Oz                 1939        92
 9     9 One Flew Over the Cuckoo's Nest  1975        84
10    10 Lawrence of Arabia               1962       100
# ℹ 90 more rows

html_table()

  • When data is in the form of a table, you can format it more easily with html_table().

  • The Wikipedia article on hurricanes: https://en.wikipedia.org/wiki/Atlantic_hurricane_season

    This contains many tables which might be a pain to copy and paste into Excel (and we would be prone to error if we did so). Let’s try to automate this procedure.

  • Save the HTML

    wikixml <- read_html("https://en.wikipedia.org/wiki/Atlantic_hurricane_season")
  • We’ll extract all of the “table” elements.

    wikidat <- html_elements(wikixml, "table")
  • Use html_table() to get a list of tables from table elements:

    tablist <- html_table(wikidat)
    class(tablist)
    [1] "list"
    length(tablist)
    [1] 20
    tablist[[19]] |>
      select(1:4)
    # A tibble: 11 × 4
       Year  Map     `Number oftropical cyclones` `Number oftropical storms`
       <chr> <chr>                          <int>                      <int>
     1 2010  ""                                21                         19
     2 2011  ""                                20                         19
     3 2012  ""                                19                         19
     4 2013  ""                                15                         14
     5 2014  ""                                 9                          8
     6 2015  ""                                12                         11
     7 2016  ""                                16                         15
     8 2017  ""                                18                         17
     9 2018  ""                                16                         15
    10 2019  ""                                18                         16
    11 Total "Total"                          164                        153
  • You can clean up, bind, or merge these tables after you have read them in.

  • Exercise: The Wikipedia page on the oldest mosques in the world has many tables.

    https://en.wikipedia.org/wiki/List_of_the_oldest_mosques

    1. Use rvest to read these tables into R.
    2. Use rvest and SelectorGadget to extract out the category for the table (mentioned in Quran, in northeast Africa, etc).
    3. Merge the data frames together. You only need to keep the building name, the country, and the time it was first build.

    Hint: It’s easier if you use a css selector of "table.wikitable" to get the table rather than just "table". I found this out by getting to the developer tools in Chrome with CTRL + Shift + I then playing around with the tables.

    The first 15 rows should look like this:

    # A tibble: 15 × 4
       Building                 Country      fb    category              
       <chr>                    <chr>        <chr> <chr>                 
     1 Al-Haram Mosque          Saudi Arabia <NA>  Mentioned in the Quran
     2 Al-Aqsa Mosque           Palestine    <NA>  Mentioned in the Quran
     3 The Sacred Monument      Saudi Arabia <NA>  Mentioned in the Quran
     4 Quba Mosque              Saudi Arabia 622   Mentioned in the Quran
     5 Mosque of the Companions Eritrea      610   Northeast Africa      
     6 Negash Āmedīn Mesgīd     Ethiopia     620   Northeast Africa      
     7 Masjid al-Qiblatayn      Somalia      620   Northeast Africa      
     8 Korijib Masjid           Djibouti     630   Northeast Africa      
     9 Mosque of Amr ibn al-As  Egypt        641   Northeast Africa      
    10 Mosque of Ibn Tulun      Egypt        879   Northeast Africa      
    11 Al-Hakim Mosque          Egypt        928   Northeast Africa      
    12 Al-Azhar Mosque          Egypt        972   Northeast Africa      
    13 Arba'a Rukun Mosque      Somalia      1268  Northeast Africa      
    14 Fakr ad-Din Mosque       Somalia      1269  Northeast Africa      
    15 Great Mosque of Kairouan Tunisia      670   Northwest Africa