There are at least 4 ways people download data on the web:
We know how to do 1.
This lesson is about doing 2 and 3.
Let’s load the tidyverse:
library(tidyverse)
API: Application Programming Interface
Many of the most popular databases/websites have an R package that wraps the API through R functions.
If an R package exists for an API, you should use it.
Examples:
tidycensus: U.S. Census data API.
gh: GitHub API.
wbstats: World Bank data.
rebird: Ebird dataset for bird sitings.
geonames: Dataset containing unique names for geographic locations.
WikipediR: Wikipedia API:
Each package will be entirely unique, and you have to read its documentation to know how to download data. So we won’t cover any of them.
The goal of this section is not to provide a comprehensive lesson on HTTP and extracting data using API’s. Rather, this is just to point you to some resources and give you some examples.
Each API is different, and so you’ll always have to figure out how to interact with a new API.
API’s always have documentation on what parameters you can send, how you can send them, and what data you would get back based on the values of those parameters.
E.g.: OMDB (an open-source version of IMDB) has a simple API http://www.omdbapi.com/
Some API’s are public and you just have at them (though, be congnizent that you are not making too many requests).
Others require authentication to access the API.
There are three ways that apps authenticate, from easiest to hardest:
“Basic” authentication, where you just provide a username and password. The basic syntax for this is:
req_auth_basic(req, username, password)
You should already have a username and password set up.
I would suggest keeping your password secure using your .Renviron
file or via {keyring}
(see below).
An API key is a string that you add to your request.
To obtain a free key from OMDB and access it in R:
Sign up for a free key: https://www.omdbapi.com/apikey.aspx
Open up your .Renviron file using the usethis package.
library(usethis)
edit_r_environ()
Add the key OMDB sent you by email to the .Renviron package. You
can call it OMDB_API_KEY
, for example. In which case you
would write the following in .Renviron:
OMDB_API_KEY = <your-private-key>
Where “<your-private-key>
” is the private key OMDB
sent you by email.
Restart R.
You can now always access your private key by
movie_key <- Sys.getenv("OMDB_API_KEY")
You typically put your API key as a query parameter via
req_url_query()
(see below)
It is important to never save or display your private key in a
file you could share. You are responsible for all behavior associated
with your key. That is why we saved it to our .Renviron and are only
accessing it secretly through Sys.getenv()
.
It is still not great that your key is in a plain text environment. You can add a layer of security by using the keyring package: https://github.com/r-lib/keyring
library(keyring)
key_set("OMDB_API_KEY_SECURE") ## do this once to save the key
movie_key_2 <- key_get("OMDB_API_KEY_SECURE") ## do this each time you need the key
A person with access to your computer who knows R and the keyring package could still get to your key. But it is more secure than placing your key in a plain text file (which is what .Renviron is). There are more secure ways to access keys in R.
See OAuth from
{httr2}
for details.
OAuth (“Open Authorization”) is an open standard for authorization that allows users to securely access resources without giving away their login credentials.
The idea is that your software asks the user if it can use the user’s authorization to access the API.
As an example, we will consider the GitHub API.
“Register an application” with the API provider before you can access the API.
You do this by creating a developer account on the API’s website, then registering a new OAuth app.
You won’t actually have an app, but API developers use this word for any means where you ask to use a user’s authorization.
This typically involves providing a name and a callback URL (typically http://localhost:1410) for your “application”.
For GitHub:
Register at https://github.com/settings/developers
Use any URL (like https://github.com/) as your homepage URL
Use http://localhost:1410 as the callback url.
The provider will then give you a client_id
and a
client_secret
that you will need to use.
Neither of these need to be protected like a password since the user will provide their own password/username for authentication.
For GitHub:
You can find these here: https://github.com/settings/developers
Click on your OAuth app.
Obtain a “token URL”, which is sometimes called an “access URL”.
This can be found somewhere in the API docs, but you have to read them carefully.
For GitHub:
Use oauth_client()
to create a client. You feed in
the client_id
, the client_secret
, the
toeken_url
, and any name you choose into it.
client <- oauth_client(
id = "client_id",
secret = "client_secret",
token_url = "token_url",
name = "personal_app_name"
)
For GitHub:
client <- oauth_client(
id = "933ffc6f53e466c58aa1",
secret = "aa02ef46f93aa51a360f23f30f7640b445118e7f",
token_url = "https://github.com/login/oauth/access_token",
name = "gitapp"
)
You get an “authorization URL”.
The autorization URL is: https://github.com/login/oauth/authorize
auth_url <- "https://github.com/login/oauth/authorize"
Feed the authorization URL and the client into
req_oauth_auth_code()
during a request.
request("https://api.github.com/user") |>
req_oauth_auth_code(client = client, auth_url = auth_url) |>
req_perform() ->
gout
gout
There are sometimes other “flows” for OAuth that require different steps. See here for details: https://httr2.r-lib.org/articles/oauth.html
When using OAuth in a package, folks often (i) “cache” the token securely (because the generated token should be kept private), and (ii) ask folks to generate their own app.
Caching is easy. Just set cache_disk = TRUE
in
req_oauth_auth_code()
.
When you ask folks to generate their own app, then you should
have client_id
and client_secret
as arguments
that the user can provide.
Most API’s on the web use HTTP (Hyper-Text Transfer Protocol). It’s a language for querying and obtaining data.
We won’t learn HTTP, but we will use R’s interface for HTTP
through the {httr2}
package.
library(httr2)
To request data from a API, you typically take these steps:
request()
.req_url_path_append()
.req_url_query()
.req_headers()
.req_oauth_auth_code()
req_method()
GET
will fetch data.
PUT
will add stuff to the target.DELETE
will delete stuff at the target.req_dry_run()
and check that it is correct.req_perform()
.Some API’s only modify the URL path, some only modify the query, some require headers. Let’s go through each of these in turn.
Every API has a base URL that you modify.
Some API’s only modify the URL path to obtain the endpoint.
Consider the Wizard World API
The documentation says that the base URL is “https://wizard-world-api.herokuapp.com”.
baseurl <- "https://wizard-world-api.herokuapp.com"
Let’s start a request with this baseurl via
request()
.
wizreq <- request(baseurl)
The documentation just says that we modify this URL to obtain the different objects. - E.g., to obtain a list of all elixirs that occur in Harry Potter, we just add “Elixirs” at the end.
req_url_path_append()
wizreq <- req_url_path_append(wizreq, "Elixirs")
wizreq
## <httr2_request>
## GET https://wizard-world-api.herokuapp.com/Elixirs
## Body: empty
Let’s look at the http request
req_dry_run(wizreq)
## GET /Elixirs HTTP/1.1
## Host: wizard-world-api.herokuapp.com
## User-Agent: httr2/0.2.3 r-curl/5.1.0 libcurl/7.81.0
## Accept: */*
## Accept-Encoding: deflate, gzip, br, zstd
We then implement this request via
req_perform()
.
eout <- req_perform(wizreq)
eout
## <httr2_response>
## GET https://wizard-world-api.herokuapp.com/Elixirs
## Status: 200 OK
## Content-Type: application/json
## Body: In memory (62284 bytes)
We would then clean this output (see rectangling below). But as a preview, we would do
tibble(elixir = resp_body_json(resp = eout)) |>
unnest_wider(col = elixir)
## # A tibble: 145 × 10
## id name effect sideEffects characteristics time difficulty ingredients
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <list>
## 1 0106fb… Ferg… Treat… Potential … <NA> <NA> Unknown <list [3]>
## 2 021b40… Mane… Rapid… <NA> <NA> <NA> Unknown <NULL>
## 3 024f56… Poti… <NA> <NA> <NA> <NA> Unknown <NULL>
## 4 06beea… Rudi… Helps… <NA> <NA> <NA> Unknown <list [2]>
## 5 078b53… Lung… Most … <NA> <NA> <NA> Unknown <NULL>
## 6 07944d… Esse… Menta… <NA> Green in colour <NA> Advanced <list [2]>
## 7 0dd8d2… Anti… Cures… <NA> Green in colour <NA> Moderate <list [4]>
## 8 0e2240… Rest… Rever… <NA> Purple coloured <NA> Unknown <NULL>
## 9 0e7228… Skel… Resto… <NA> Smokes when po… <NA> Moderate <list [6]>
## 10 0e7f89… Chee… <NA> <NA> Yellow in colo… <NA> Advanced <list [1]>
## # ℹ 135 more rows
## # ℹ 2 more variables: inventors <list>, manufacturer <chr>
Some APIs require you to modify the URL via queries.
Queries occur after a question mark and are of the form
http://www.url.com/?query1=arg1&query2=ar2&query3=arg3
The OMDB API requires you to provide queries. The documentation says
Send all data requests to: http://www.omdbapi.com/?apikey=[yourkey]&
But that documentation already has a query as a part of it (apikey=[yourkey]).
You add queries via req_url_query()
.
The documentation for OMDB has a table called “Parameters”, where they list the possible queries.
Let’s fetch information on the film The Lighthouse, obtaining a short plot in json format.
This is what the URL looks like without my API key:
movie_key <- Sys.getenv("OMDB_API_KEY")
request("http://www.omdbapi.com/") |>
req_url_query(t = "The Lighthouse",
plot = "short",
r = "json") ->
movie_req
movie_req
## <httr2_request>
## GET http://www.omdbapi.com/?t=The%20Lighthouse&plot=short&r=json
## Body: empty
Let’s add our API key and perform the request.
movie_req |>
req_url_query(apikey = movie_key) |>
req_perform() ->
mout
mout
## <httr2_response>
## GET
## http://www.omdbapi.com/?t=The%20Lighthouse&plot=short&r=json&apikey=18375155
## Status: 200 OK
## Content-Type: application/json
## Body: In memory (994 bytes)
Output is just a list:
resp_body_json(mout) |>
str()
## List of 25
## $ Title : chr "The Lighthouse"
## $ Year : chr "2019"
## $ Rated : chr "R"
## $ Released : chr "01 Nov 2019"
## $ Runtime : chr "109 min"
## $ Genre : chr "Drama, Fantasy, Horror"
## $ Director : chr "Robert Eggers"
## $ Writer : chr "Robert Eggers, Max Eggers"
## $ Actors : chr "Robert Pattinson, Willem Dafoe, Valeriia Karaman"
## $ Plot : chr "Two lighthouse keepers try to maintain their sanity while living on a remote and mysterious New England island in the 1890s."
## $ Language : chr "English"
## $ Country : chr "United States, Canada"
## $ Awards : chr "Nominated for 1 Oscar. 33 wins & 134 nominations total"
## $ Poster : chr "https://m.media-amazon.com/images/M/MV5BZmE0MGJhNmYtOWNjYi00Njc5LWE2YjEtMWMxZTVmODUwMmMxXkEyXkFqcGdeQXVyMTkxNjU"| __truncated__
## $ Ratings :List of 3
## ..$ :List of 2
## .. ..$ Source: chr "Internet Movie Database"
## .. ..$ Value : chr "7.4/10"
## ..$ :List of 2
## .. ..$ Source: chr "Rotten Tomatoes"
## .. ..$ Value : chr "90%"
## ..$ :List of 2
## .. ..$ Source: chr "Metacritic"
## .. ..$ Value : chr "83/100"
## $ Metascore : chr "83"
## $ imdbRating: chr "7.4"
## $ imdbVotes : chr "239,704"
## $ imdbID : chr "tt7984734"
## $ Type : chr "movie"
## $ DVD : chr "18 Oct 2019"
## $ BoxOffice : chr "$10,867,104"
## $ Production: chr "N/A"
## $ Website : chr "N/A"
## $ Response : chr "True"
In the API documentation:
Headers supply additional options for the return type.
Common headers are described by Wikipedia: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
You supply headers to a request via
req_headers()
.
Consider the icanhazdadjoke API. One option is to include a header to specify plain text returns, rather than JSON returns.
request("https://icanhazdadjoke.com/") |>
req_headers(Accept = "text/plain") |>
req_perform() ->
dadout
dadout
resp_body_string(dadout)
This is different than a JSON output
request("https://icanhazdadjoke.com/") |>
req_perform() ->
dadout2
resp_body_string(dadout2)
Functions that work with the response are all of the form
resp_*()
.
The status code describes whether your request was successful.
List of codes: https://http.cat/
Use resp_status()
to get the code for our
request:
resp_status(mout)
## [1] 200
Headers provide infromation on the request. Use the
resp_headers()
function to see what headers we got in our
request.
resp_headers(mout)
The body contains the data you are probably most
interested in. Use the resp_body_*()
functions to access
the body:
resp_body_json(mout)
Background: The body typically comes in the form of either a JSON or XML data structure.
resp_body_json()
resp_body_xml()
resp_body_string()
resp_body_string(mout)
## [1] "{\"Title\":\"The Lighthouse\",\"Year\":\"2019\",\"Rated\":\"R\",\"Released\":\"01 Nov 2019\",\"Runtime\":\"109 min\",\"Genre\":\"Drama, Fantasy, Horror\",\"Director\":\"Robert Eggers\",\"Writer\":\"Robert Eggers, Max Eggers\",\"Actors\":\"Robert Pattinson, Willem Dafoe, Valeriia Karaman\",\"Plot\":\"Two lighthouse keepers try to maintain their sanity while living on a remote and mysterious New England island in the 1890s.\",\"Language\":\"English\",\"Country\":\"United States, Canada\",\"Awards\":\"Nominated for 1 Oscar. 33 wins & 134 nominations total\",\"Poster\":\"https://m.media-amazon.com/images/M/MV5BZmE0MGJhNmYtOWNjYi00Njc5LWE2YjEtMWMxZTVmODUwMmMxXkEyXkFqcGdeQXVyMTkxNjUyNQ@@._V1_SX300.jpg\",\"Ratings\":[{\"Source\":\"Internet Movie Database\",\"Value\":\"7.4/10\"},{\"Source\":\"Rotten Tomatoes\",\"Value\":\"90%\"},{\"Source\":\"Metacritic\",\"Value\":\"83/100\"}],\"Metascore\":\"83\",\"imdbRating\":\"7.4\",\"imdbVotes\":\"239,704\",\"imdbID\":\"tt7984734\",\"Type\":\"movie\",\"DVD\":\"18 Oct 2019\",\"BoxOffice\":\"$10,867,104\",\"Production\":\"N/A\",\"Website\":\"N/A\",\"Response\":\"True\"}"
resp_header("content-type")
will often tell you the
type of body output.
resp_header(mout, "content-type")
## [1] "application/json; charset=utf-8"
resp_header(eout, "content-type")
## [1] "application/json; charset=utf-8"
The format for the data you typically get from an API is a list of lists.
The repurrrsive
package contains a few examples.
library(repurrrsive)
The dataset gh_users
contains information from 6
users, obtained using GitHub’s API: https://developer.github.com/v3/users/#get-a-single-user
data("gh_users")
typeof(gh_users)
## [1] "list"
length(gh_users)
## [1] 6
names(gh_users[[1]])
## [1] "login" "id" "avatar_url"
## [4] "gravatar_id" "url" "html_url"
## [7] "followers_url" "following_url" "gists_url"
## [10] "starred_url" "subscriptions_url" "organizations_url"
## [13] "repos_url" "events_url" "received_events_url"
## [16] "type" "site_admin" "name"
## [19] "company" "blog" "location"
## [22] "email" "hireable" "bio"
## [25] "public_repos" "public_gists" "followers"
## [28] "following" "created_at" "updated_at"
There are a few functions from tidyr
package that
can be used to “unnest” a list into a data frame.
unnest_wider()
: Each list element becomes its own
columnhoist()
: Similar as unnest_wider()
, but
you can choose what elements to extract to their own columns.unnest_longer()
: Each list element becomes its own
row.Because this is the tidyverse, your list needs to be a column in a data frame:
df <- tibble(users = gh_users)
df
## # A tibble: 6 × 1
## users
## <list>
## 1 <named list [30]>
## 2 <named list [30]>
## 3 <named list [30]>
## 4 <named list [30]>
## 5 <named list [30]>
## 6 <named list [30]>
In which case, you can consider the unnest functions via:
unnest_wider()
preserves the rows, but changes the
columns.unnest_longer()
preserves the columns, but changes the
rowsUse unnest_wider()
to make each of those list
elements their own column.
unnest_wider(df, users)
## # A tibble: 6 × 30
## login id avatar_url gravatar_id url html_url followers_url following_url
## <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 gabo… 6.60e5 https://a… "" http… https:/… https://api.… https://api.…
## 2 jenn… 5.99e5 https://a… "" http… https:/… https://api.… https://api.…
## 3 jtle… 1.57e6 https://a… "" http… https:/… https://api.… https://api.…
## 4 juli… 1.25e7 https://a… "" http… https:/… https://api.… https://api.…
## 5 leep… 3.51e6 https://a… "" http… https:/… https://api.… https://api.…
## 6 masa… 8.36e6 https://a… "" http… https:/… https://api.… https://api.…
## # ℹ 22 more variables: gists_url <chr>, starred_url <chr>,
## # subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
## # events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
## # name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
## # hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
## # followers <int>, following <int>, created_at <chr>, updated_at <chr>
Graphic:
Use hoist()
to only extract some elements of a
list:
df2 <- hoist(df, users,
login = "login",
followers = "followers")
df2
## # A tibble: 6 × 3
## login followers users
## <chr> <int> <list>
## 1 gaborcsardi 303 <named list [28]>
## 2 jennybc 780 <named list [28]>
## 3 jtleek 3958 <named list [28]>
## 4 juliasilge 115 <named list [28]>
## 5 leeper 213 <named list [28]>
## 6 masalmon 34 <named list [28]>
Notice that the new data frame still has a list column, but each
each list now only has 28 elements instead of 30. That’s because
hoist()
has removed the login
and
followers
elements from each list in the list column.
df2$users[[1]]["login"]
## $<NA>
## NULL
df2$users[[1]]["followers"]
## $<NA>
## NULL
Graphic:
Let’s look at the list gh_repos
, which was also
obtained using the GitHub API: https://developer.github.com/v3/repos/#list-user-repositories
data("gh_repos")
typeof(gh_repos)
## [1] "list"
length(gh_repos)
## [1] 6
df <- tibble(repo = gh_repos)
df
## # A tibble: 6 × 1
## repo
## <list>
## 1 <list [30]>
## 2 <list [30]>
## 3 <list [30]>
## 4 <list [26]>
## 5 <list [30]>
## 6 <list [30]>
This list is very complicated!
Ideally, each repo would be its own row (since the repositories
are the observational units). We can do this with
unnest_longer()
.
df2 <- unnest_longer(df, repo)
df2
## # A tibble: 176 × 1
## repo
## <list>
## 1 <named list [68]>
## 2 <named list [68]>
## 3 <named list [68]>
## 4 <named list [68]>
## 5 <named list [68]>
## 6 <named list [68]>
## 7 <named list [68]>
## 8 <named list [68]>
## 9 <named list [68]>
## 10 <named list [68]>
## # ℹ 166 more rows
Graphic:
We can then use unnest_wider()
or
hoist()
on this new data frame to get a tidy dataset.
df3 <- unnest_wider(df2, repo)
df3
## # A tibble: 176 × 68
## id name full_name owner private html_url description fork url
## <int> <chr> <chr> <list> <lgl> <chr> <chr> <lgl> <chr>
## 1 6.12e7 after gaborcsa… <named list> FALSE https:/… Run Code i… FALSE http…
## 2 4.05e7 argu… gaborcsa… <named list> FALSE https:/… Declarativ… FALSE http…
## 3 3.64e7 ask gaborcsa… <named list> FALSE https:/… Friendly C… FALSE http…
## 4 3.49e7 base… gaborcsa… <named list> FALSE https:/… Do we get … FALSE http…
## 5 6.16e7 cite… gaborcsa… <named list> FALSE https:/… Test R pac… TRUE http…
## 6 3.39e7 clis… gaborcsa… <named list> FALSE https:/… Unicode sy… FALSE http…
## 7 3.72e7 cmak… gaborcsa… <named list> FALSE https:/… port of cm… TRUE http…
## 8 6.80e7 cmark gaborcsa… <named list> FALSE https:/… CommonMark… TRUE http…
## 9 6.32e7 cond… gaborcsa… <named list> FALSE https:/… <NA> TRUE http…
## 10 2.43e7 cray… gaborcsa… <named list> FALSE https:/… R package … FALSE http…
## # ℹ 166 more rows
## # ℹ 59 more variables: forks_url <chr>, keys_url <chr>,
## # collaborators_url <chr>, teams_url <chr>, hooks_url <chr>,
## # issue_events_url <chr>, events_url <chr>, assignees_url <chr>,
## # branches_url <chr>, tags_url <chr>, blobs_url <chr>, git_tags_url <chr>,
## # git_refs_url <chr>, trees_url <chr>, statuses_url <chr>,
## # languages_url <chr>, stargazers_url <chr>, contributors_url <chr>, …
Graphic:
Exercise: Clean up the sw_people
list from the repurrrsive
package:
library(repurrrsive)
data("sw_people")
Exercise: Clean up the got_chars
list from the repurrrsive
package
library(repurrrsive)
data("got_chars")
Consider the NASA API
Generate an API key and save it as NASA_API_KEY
in
your R environment file.
Read about the APOD API and obtain all images from January of 2022.
Clean the output into a data frame, like this:
## # A tibble: 31 × 8
## copyright date explanation hdurl media_type service_version title url
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "Soumyadeep M… 2022… very Full … http… image v1 The … http…
## 2 "\nDani Caxet… 2022… Sometimes … http… image v1 Quad… http…
## 3 "\nJan Hatten… 2022… You couldn… http… image v1 Come… http…
## 4 <NA> 2022… What's hap… http… image v1 Moon… http…
## 5 "\nLuca Vanze… 2022… Does the S… http… image v1 A Ye… http…
## 6 "Tamas Ladany… 2022… That's not… http… image v1 The … http…
## 7 "Point Blue C… 2022… A male Ade… http… image v1 Ecst… http…
## 8 "Cheng Luo" 2022… Named for … http… image v1 Quad… http…
## 9 <NA> 2022… What will … http… image v1 Hubb… http…
## 10 <NA> 2022… Why does C… <NA> video v1 Come… http…
## # ℹ 21 more rows