Web Scraping with rvest

Author

David Gerard

Published

September 26, 2025

Learning Objectives

Data on the Web

  • There are at least 4 ways people download data on the web:

    1. Click to download a csv/xls/txt file.
    2. Use a package that interacts with an API.
    3. Use an API directly.
    4. Scrape directly from the HTML file.
  • This lesson, we talk about how to do 4.

  • Note: You shouldn’t download thousands of HTML files from a website to parse — the admins might block you if you send too many requests.

  • Note: Web scraping can be illegal in some circumstances, particularly if you intend to make money off of it or if you are collecting personal information (especially in Europe). I don’t give legal advice, so see Chapter 24 of RDS for some general recommendations, and talk to a lawyer if you are not sure.

  • Let’s load the tidyverse:

    library(tidyverse)

HTML / CSS

  • We have to know a little bit about HTML and CSS in order to understand how to extract certain elements from a website.

  • HTML stands for “HyperText Markup Language”

    <html>
    <head>
      <title>My First Web Page</title>
    </head>
    <body>
      <h1>Welcome!</h1>
      <p>This is a <b>simple</b> paragraph.</p>
      <a href="https://en.wikipedia.org/">Wikipedia</a>
    </body>
    </html>
  • HTML consists of elements which start with a tag inside <> (like <head> and <body>), optional attributes that format the element (like href=url), contents (the text), and an end tag (like </head> and </body>). The above HTML text would be formatted like this:


My First Web Page

Welcome!

This is a simple paragraph.

Wikipedia
  • Important tags for web scraping:

    • <h1><h6>: Heading tags, with <h1> as the highest (most important).
    • <p>: Paragraph of text.
    • <a>: Creates hyperlinks to other pages or resources.
    • <img>: Embeds an image.
    • <div>: Generic container for layout and styling.
    • <span>: Inline container for styling parts of text.
    • <ul>: Unordered list (bulleted).
    • <ol>: Ordered list (numbered).
    • <li>: List item, used inside <ul> or <ol>.
    • <table>: Defines a table structure.
    • <tr>: Table row.
    • <td>: Table data cell.
    • <th>: Table header cell.
    • <strong>: Strong importance (usually bold).
    • <em>: Emphasized text (usually italic).
  • CSS stands from “Cascading Style Sheets”. It’s a formatting language that indicates how HTML files should look. Every website you have been on is formatted with CSS.

  • Here is some example CSS:

    h3 {
      color: red;
      font-style: italic;
    }
    
    footer div.alert {
      display: none;
    }
  • The part before the curly braces is called a selector. It corresponds to HTML tags. Specifically, for those two they would correspond to:

    <h3>Some text</h3>
    
    <footer>
    <div class="alert">More text</div>
    </footer>
  • The code inside the curly braces are properties. For example, the h3 properties tells us to make the h3 headers red and in italics. The second CSS chunk says that all <div> tags of class "alert" in the <footer> should be hidden.

  • CSS applies the same properties to the same selectors. So every time we use h3 will result in the h3 styling of red and italicized text.

  • CSS selectors define patterns for selecting HTML elements. This is useful for scraping because we can extract all text in an HTML that corresponds to some CSS selector.

  • You can get a long way just selecting all p elements (standing for “paragraph”) since that is where a lot of text lives.

  • The most common attributes used are id and class.

    • The selectors corresponding to class begin with a dot ..
    • The selectors corresponding to id begin with a hashtak #.
  • The .a selector selects for “Text 1” in the following

    <p class="a">Text 1</p>
  • The .a selector selects for “Text 2” in the following

    <div class="a">Text 2</div>
  • The #b selector selects for “Text 3” in the following

    <p id="b">Text 3</p>
  • The #b selector selects for “Text 4” in the following

    <div id="b">Text 4</div>
  • More complicated selectors (from Richard Ressler):

    • The name selector just uses the name value of the element such as h3. All elements with the same name value will be selected.
    • The id selector uses a #, e.g., #my_id, to select a single element with id=my_id (all ids are unique within a page).
    • The class selector uses a ., e.g., .my_class, where class=my_class. All elements with the same class value will be selected.
    • We can combine selectors with ., , and/or \ to select a single element or groups of similar elements.
      • A selector of my_name.my_class combines name and class to select all (only) elements with the name=my_name and class=my_class.
    • The most important combinator is the white space, , the descendant combination. As an example, p a selects all <a> elements that are a child of (nested beneath) a <p> element in the tree.
    • You can also find elements based on the values of attributes, e.g., find an element based on an attribute containing specific text.
      • For a partial text search you would use '[attribute_name*="my_text"]'. Note the combination of single quotes and double quotes so you have double quotes around the value.

rvest

  • We’ll use rvest to extract elements from HTML files.

    library(rvest)
  • The typical pipeline for rvest is:

    • Load the html file into R using read_html()
    • Choose the selectors based on SelectorGadget (see below) or by inspecting the selectors manually using developer tools (see below).
    • Select those selectors using html_elements().
      • Possibly select elements within those elements via html_element()
      • E.g. html_elements() selects the observational units and html_element() selects values of variables within that unit.
    • Extract the text using html_text2().
      • Or, extract tables using html_table().
    • Extreme cleaning using 412/612 tools.
  • We’ll do a real example after we cover SelectorGadget and the web developer tools. But for now, let’s create a small html file:

    html <- minimal_html('
    <p class="a">Text 1</p>
    <div class="a">Text 2</div>
    <p id="b">Text 3</p>
    <div id="b">Text 4</div>
    ')
  • We can get all p tag text via

    html_elements(html, "p") |>
      html_text2()
    [1] "Text 1" "Text 3"
  • We can get all div tag text via

    html_elements(html, "div") |>
      html_text2()
    [1] "Text 2" "Text 4"
  • We can get all class=a text via

    html_elements(html, ".a") |>
      html_text2()
    [1] "Text 1" "Text 2"
  • We can get all id=b text via

    html_elements(html, "#b") |>
      html_text2()
    [1] "Text 3" "Text 4"
  • Once you use html_elements(), it’s common to then use html_element() to extract even more information.

    html_k <- minimal_html("
    <p><emph>A</emph>: <b>Ape</b> picks an <b>Apple</b> for <b>Aardvark</b> below.</p>
    <p><emph>L</emph>: <b>Lion</b> <b>Lifts</b> <b>Ladybug's</b> <b>Luggage</b></p>
    <p><emph>P</emph>: <b>Penguin</b> <b>Plays</b> with <b>Platypus</b> in the <b>Pool</b></p>
    ")
    html_k |>
      html_elements("p") |>
      html_element("emph") |>
      html_text2()
    [1] "A" "L" "P"
  • If you want all of the bs that are within a p, you can use .

    html_k |>
      html_elements("p b") |>
      html_text()
     [1] "Ape"       "Apple"     "Aardvark"  "Lion"      "Lifts"     "Ladybug's"
     [7] "Luggage"   "Penguin"   "Plays"     "Platypus"  "Pool"     
  • Exercise: Try extracting b with both html_element() and html_elements(). What’s the difference?

  • Exercise (from R4DS): Get all of the text from the li element below:

    html <- minimal_html("
      <ul>
        <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
        <li><b>R4-P17</b> is a <i>droid</i></li>
        <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
        <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
      </ul>
      ")
  • Exercise (from R4DS): Extract the name of each droid. Start with the output of the second exercise.

  • Exercise (from R4DS): Use the class attribute of weight to extract the weight of each droid. Do not use span. Start with the output of the second exercise.

SelectorGadget

  • SelectorGadget is a tool for you to see what selector influences a particular element on a website.

  • To install SelectorGadget, drag this link to your bookmark bar on Chrome: SelectorGadget

  • Suppose we wanted to get the top 100 movies of all time from IMDB. The web page is very unstructured:

    https://www.imdb.com/list/ls055592025/

     

If the above link fails, try: https://data-science-master.github.io/lectures/08_web_scraping/imdb_100.html```

  • If we click on the ranking of the Godfather, the “1” turns green (indicating what we have selected).

     

  • The “.text-primary” is the selector associated with the “1” we clicked on.

  • Everything highlighted in yellow also has the “.text-primary” selector associated with it.

  • We will also want the name of the movie. So if we click on that we get the selector associated with both the rank and the movie name: “a , .text-primary”.

     

  • But we also got a lot of stuff we don’t want (in yellow). If we click one of the yellow items that we don’t want, it turns red. This indicates that we don’t want to select it.

     

  • Only the ranking and the name remain, which are under the selector “.ipc-title-link-wrapper .ipc-title__text–reduced”.

  • It’s important to visually inspect the selected elements throughout the whole HTML file. SelectorGadget doesn’t always get all of what you want, or it sometimes gets too much.

What selector can we use to get just the names of each film, the metacritic score, and the IMDB rating?

Here is what I got:

".ipc-rating-star--rating , .metacritic-score-label, .ipc-title-link-wrapper .ipc-title__text--reduced"

Chrome developer tools:

  • If you have trouble with SelectorGadget, you can also use the Chrome developer tools.

  • Chrome works best for web scraping (better than Safari/Edge/Firefox/etc). So install it if you don’t have it.

  • Open up the list of all selectors with: ⋮ > More tools > Developer tools.

  • Clicking on the element selector on the top left of the developer tools will show you what selectors are possible with each element.

     

  • You can also right click on the part of the website you are interested in and then click “Inspect”.

  • In the developer tools, hover over the element you are interested, right click, and then click Copy > Copy selector. This gives you the selector for that element that you can then inspect.

More rvest

  • Let’s do a more complicated example of rvest.

  • Use read_html() to save an HTML file to a variable. The variable will be an “xml_document” object

    html_obj <- read_html("https://www.imdb.com/list/ls055592025/")
    html_obj
    class(html_obj)
  • Try read_html_live() if you notice read_html() is not working.

  • XML stands for “Extensible Markup Language”. It’s a markup language (like HTML and Markdown), useful for representing a document. rvest will store the HTML file as an XML.

  • We can use html_elements() and the selectors we found in the previous section to get the elements we want. Insert the found selectors as the css argument.

    ranking_elements <- html_elements(html_obj, css = ".ipc-title-link-wrapper .ipc-title__text--reduced")
    head(ranking_elements)
    {xml_nodeset (6)}
    [1] <h3 class="ipc-title__text ipc-title__text--reduced">1. The Godfather</h3>
    [2] <h3 class="ipc-title__text ipc-title__text--reduced">2. The Shawshank Red ...
    [3] <h3 class="ipc-title__text ipc-title__text--reduced">3. Schindler's List< ...
    [4] <h3 class="ipc-title__text ipc-title__text--reduced">4. Raging Bull</h3>
    [5] <h3 class="ipc-title__text ipc-title__text--reduced">5. Casablanca</h3>
    [6] <h3 class="ipc-title__text ipc-title__text--reduced">6. Citizen Kane</h3>
  • Note: html_element() is similar, but will return exactly one response per element, so is useful if some elements have missing components.

  • To extract the text inside the obtained nodes, use html_text() or html_text2():

    • html_text2() just does a little more pre-formatting (like converting line breaks from HTML to R code, removing white spaces, etc). So you should typically use this.
    ranking_text <- html_text2(ranking_elements)
    head(ranking_text)
    [1] "1. The Godfather"            "2. The Shawshank Redemption"
    [3] "3. Schindler's List"         "4. Raging Bull"             
    [5] "5. Casablanca"               "6. Citizen Kane"            
  • After you do this, you need to tidy the data using your data munging tools.

    tibble(text = ranking_text) |>
      separate(col = "text", into = c("ranking", "movie"), sep = "\\.", extra = "merge") ->
    movierank
    movierank
    # A tibble: 100 × 2
       ranking movie                             
       <chr>   <chr>                             
     1 1       " The Godfather"                  
     2 2       " The Shawshank Redemption"       
     3 3       " Schindler's List"               
     4 4       " Raging Bull"                    
     5 5       " Casablanca"                     
     6 6       " Citizen Kane"                   
     7 7       " Gone with the Wind"             
     8 8       " The Wizard of Oz"               
     9 9       " One Flew Over the Cuckoo's Nest"
    10 10      " Lawrence of Arabia"             
    # ℹ 90 more rows

Extract the directors and the names of each film. Try to use SelectorGadget to find your own selectors.

There are probably multiple ways to do this. But I used ".dli-parent" to get get the movies. Then I did two separate calls to html_element() with ".ipc-title__text--reduced" to get the titles and ".bDNbpf span" to get the directors.

html_elements(html_obj, ".dli-parent") |>
  html_element(".ipc-title__text--reduced") |>
  html_text2() ->
  titvec

html_elements(html_obj, ".dli-parent")  |>
  html_element(".bDNbpf span") |>
  html_text2() ->
  dirvec

tibble(title = titvec, dir = dirvec) |>
  separate(col = "title", into = c("rank", "title"), sep = "\\.") |>
  mutate(dir = str_extract(string = dir, pattern = "Director.+Stars")) |>
  mutate(dir = str_remove(dir, "^Directors*")) |>
  mutate(dir = str_remove(dir, "Stars*$"))
# A tibble: 100 × 3
   rank  title                              dir                                 
   <chr> <chr>                              <chr>                               
 1 1     " The Godfather"                   Francis Ford Coppola                
 2 2     " The Shawshank Redemption"        Frank Darabont                      
 3 3     " Schindler's List"                Steven Spielberg                    
 4 4     " Raging Bull"                     Martin Scorsese                     
 5 5     " Casablanca"                      Michael Curtiz                      
 6 6     " Citizen Kane"                    Orson Welles                        
 7 7     " Gone with the Wind"              Victor Fleming                      
 8 8     " The Wizard of Oz"                Victor FlemingGeorge CukorNorman Ta…
 9 9     " One Flew Over the Cuckoo's Nest" Milos Forman                        
10 10    " Lawrence of Arabia"              David Lean                          
# ℹ 90 more rows

A very simple example

  • Here is a very simple html file that is generated using rvest:

    html <- minimal_html("
      <h1>This is a heading</h1>
      <p id='first'>This is a paragraph</p>
      <p class='important'>This is an important paragraph</p>
    ")
  • The h1 selector selects for h1 tags.

    html_elements(html, "h1") |>
      html_text()
    [1] "This is a heading"
  • The .important selector selects for class attribute that is important

    html_elements(html, ".important") |>
      html_text()
    [1] "This is an important paragraph"
  • The #first selector selects for id attribute that is first

    html_elements(html, "#first") |>
      html_text()
    [1] "This is a paragraph"

Bigger example using rvest

  • You typically use html_elements() and html_element() together. You first use html_elements() to select observations. You then use html_element() to select values of variables from each observation.

  • Let’s try and get the name, rank, year, and metascore for each movie.

  • I played with the developer tools until I saw that

    • “.ipc-metadata-list-summary-item” extracts each movie
    • “.ipc-title” extracts the title from a movie
    • “.metacritic-score-box” extracts the meteascore from a movie
    • “.dli-title-metadata-item” extracts the year, runtime, and rating for each movie
movie_list <- html_elements(html_obj, ".ipc-metadata-list-summary-item") 
length(movie_list) ## should be 100
[1] 100
tibble(
  title = movie_list |>
    html_element(".ipc-title") |>
    html_text2(),
  meta = movie_list |>
    html_element(".metacritic-score-box") |>
    html_text2(),
  year = movie_list |>
    html_element(".dli-title-metadata-item") |>
    html_text2()
)
# A tibble: 100 × 3
   title                              meta  year 
   <chr>                              <chr> <chr>
 1 1. The Godfather                   100   1972 
 2 2. The Shawshank Redemption        82    1994 
 3 3. Schindler's List                95    1993 
 4 4. Raging Bull                     90    1980 
 5 5. Casablanca                      100   1942 
 6 6. Citizen Kane                    100   1941 
 7 7. Gone with the Wind              97    1939 
 8 8. The Wizard of Oz                92    1939 
 9 9. One Flew Over the Cuckoo's Nest 84    1975 
10 10. Lawrence of Arabia             100   1962 
# ℹ 90 more rows

We could of course clean the title column here into rank and title.

If we wanted the runtime and rating, we could loop over the movie list that we created and extract out each of the three elements that belong to “.dli-title-metadata-item”

year_vec <- rep(NA, length = length(movie_list))
runtime_vec <- rep(NA, length = length(movie_list))
rating_vec <- rep(NA, length = length(movie_list))
for (i in seq_along(movie_list)) {
  movie_list[i] |>
    html_elements(".dli-title-metadata-item") |>
    html_text2() ->
    x
  year_vec[[i]] <- x[[1]]
  runtime_vec[[i]] <- x[[2]]
  rating_vec[[i]] <- x[[3]]
}
head(year_vec)
[1] "1972" "1994" "1993" "1980" "1942" "1941"
head(runtime_vec)
[1] "2h 55m" "2h 22m" "3h 15m" "2h 9m"  "1h 42m" "1h 59m"
head(rating_vec)
[1] "R"  "R"  "R"  "R"  "PG" "PG"

html_table()

  • When data is in the form of a table, you can format it more easily with html_table().

  • The Wikipedia article on hurricanes in 2024: https://en.wikipedia.org/wiki/2024_Atlantic_hurricane_season

  • If the above link fails, try: https://data-science-master.github.io/lectures/08_web_scraping/wiki_2.html

    This contains many tables which might be a pain to copy and paste into Excel (and we would be prone to error if we did so). Let’s try to automate this procedure.

  • Save the HTML

    wikixml <- read_html("https://en.wikipedia.org/wiki/2024_Atlantic_hurricane_season")
  • We’ll extract all of the “table” elements.

    wikidat <- html_elements(wikixml, "table")
  • Use html_table() to get a list of tables from table elements:

    tablist <- html_table(wikidat)
    class(tablist)
    [1] "list"
    length(tablist)
    [1] 27
    tablist[[3]]
    # A tibble: 10 × 3
        Rank Cost               Season
       <int> <chr>               <int>
     1     1 ≥ $294.803 billion   2017
     2     2 $172.297 billion     2005
     3     3 $130.438 billion     2024
     4     4 $117.708 billion     2022
     5     5 ≥ $80.827 billion    2021
     6     6 $72.341 billion      2012
     7     7 $61.148 billion      2004
     8     8 $54.336 billion      2020
     9     9 ≥ $50.526 billion    2018
    10    10 ≥ $48.855 billion    2008
  • You can clean up, bind, or merge these tables after you have read them in.

The Wikipedia page on the oldest mosques in the world has many tables: https://en.wikipedia.org/wiki/List_of_the_oldest_mosques

If the above link fails, try: https://data-science-master.github.io/lectures/08_web_scraping/mosque_2.html

  1. Use rvest to read these tables into R.
  2. Merge the data frames together. You only need to keep the building name, the country, and the time it was first build.

It’s easier if you use a css selector of "table.wikitable" with html_elements() first to get the table rather than just "table". I found this out by getting to the developer tools in Chrome with CTRL + Shift + I then playing around with the tables.

mosque <- read_html("https://data-science-master.github.io/lectures/05_web_scraping/mosque.html")
mosque |>
  html_elements("h3") |>
  html_text2() ->
  catvec
catvec <- c("Mentioned in Quran", catvec)

mosque |>
  html_elements("table.wikitable") |>
  html_table() ->
  tablist
## Errors if you try bind_rows() because some are integers and some are characters
for (i in seq_along(tablist)) {
  if (any(names(tablist[[i]]) == "First built")) {
    names(tablist[[i]])[names(tablist[[i]]) == "First built"] <- "date"
    tablist[[i]]$date <- str_remove_all(tablist[[i]]$date, "\\[.*\\]")
  }
}
tb <- bind_rows(tablist)
tb |>
  select(Building, Location, Country, date, Notes, Tradition)
# A tibble: 176 × 6
   Building                               Location Country date  Notes Tradition
   <chr>                                  <chr>    <chr>   <chr> <chr> <chr>    
 1 Al-Haram Mosque                        Mecca    Saudi … Unkn… "Al-…  <NA>    
 2 Haram al-Sharif, also known as the Al… Jerusal… Palest… Cons… "Al-…  <NA>    
 3 The Sacred Monument                    Muzdali… Saudi … Unkn… "Al-…  <NA>    
 4 Quba Mosque                            Medina   Saudi … 622   "The…  <NA>    
 5 Mosque of the Companions               Massawa  Eritrea 620s… "Bel… ""       
 6 Al Nejashi Mosque                      Negash   Ethiop… 7th … "By … ""       
 7 Mosque of Amr ibn al-As                Cairo    Egypt   641   "Nam… ""       
 8 Mosque of Ibn Tulun                    Cairo    Egypt   879   ""    ""       
 9 Al-Azhar Mosque                        Cairo    Egypt   972   ""    "Sunni"  
10 Arba'a Rukun Mosque                    Mogadis… Somalia 1268… ""    "Sunni"  
# ℹ 166 more rows