library(DBI)
library(duckdb)
SQL
Learning Objectives
- Learn some SQL
- Interface SQL with R through the
{DBI}
and{duckdb}
packages.- Introduction to DBI: https://solutions.posit.co/connections/db/r-packages/dbi/
- DuckDB SQL Introduction: https://duckdb.org/docs/sql/introduction
- DuckDB R API: https://duckdb.org/docs/api/r.html
- Write SQL code using the tidyverse and the
{dbplyr}
package.{dbplyr}
and SQL: https://dbplyr.tidyverse.org/articles/sql.html- R for Data Science Database Chapter: https://r4ds.hadley.nz/databases.html
SQL Introduction and Motivation
SQL is pronounced either “ess-kew-ell” or “sequel” (either is fine).
SQL is a programming language which basically allows you to do a lot that dplyr does (filters, selects, joins, etc…). It is the industry standard for interacting with a relational database.
You all are experts in data munging once a dataset has been downloaded.
But, many datasets live in a relational database which is too huge to download onto your computer.
A common task for a data scientist is to download a subset of this huge database so that they can explore it on their computer.
SQL is how you download this subset. You choose which data frames (“tables” in SQL) to download, what subsets of these tables to download (by filtering rows), and whether to join tables together in this process.
We will use R Studio in this lesson. But if you get really into SQL, a popular IDE is DBeaver, with a tutorial from DuckDB here.
DuckDB
There are a lots of database management systems (DBMS) that use SQL to connect to their databases.
In this lecture, we will only consider DuckDB since it has no external dependencies and is pretty easy to install. Just do this in R:
R
install.packages("duckdb")
The most popular DBMS is probably SQLite. Other popular ones are MySQL and PostgreSQL.
DuckDB is a specific SQL backend. The R package that is used to connect to any SQL backend is
{DBI}
, and so we will load that one too.R
We will compare to the tidyverse:
R
library(nycflights13) library(tidyverse)
Let’s download a duck database that I created. Put this in a location you can get to: https://data-science-master.github.io/lectures/data/flights.duckdb
Use
duckdb()
andDBI::dbConnect()
to create a connection to “flights.duckdb”.R
<- dbConnect(duckdb(dbdir = "../data/flights.duckdb", read_only = TRUE)) con
A basic SQL code chunk looks like this (put SQL code between the chunks):
```{sql, connection=con} ```
Basic SQL
General
Case does not matter (i.e.
select
is the same asSELECT
is the same asSeLeCt
), but it is standard to have all statements be in UPPERCASE (i.e.SELECT
).The statements below must be in the following order:
SELECT
,FROM
,WHERE
,GROUP BY
,ORDER BY
.New lines and white space don’t matter. But it is common to put those five commands above on new lines.
Character values must be in single quotes.
You can use invalid variable names by putting them in double quotes (same as using backticks in R).
- Some folks always use double quotes because it is not always clear what is an invalid variable name in the database management system. This is what I do below.
Comments in SQL have two hyphens
--
.Make sure to put a semicolon
;
at the end of a SQL statement. This will allow you to have multiple SQL queries in one chunk.
Showing Tables
The
SHOW TABLES
command can be used to get a list of all of the tablesSQL
TABLES; SHOW
5 records name airlines airports flights planes weather The
DESCRIBE
command can be used to show tables and the variables.SQL
DESCRIBE;
5 records database schema name column_names column_types temporary flights main airlines carrier, name VARCHAR, VARCHAR FALSE flights main airports faa , name , lat , lon , alt , tz , dst , tzone VARCHAR, VARCHAR, DOUBLE , DOUBLE , DOUBLE , DOUBLE , VARCHAR, VARCHAR FALSE flights main flights year , month , day , dep_time , sched_dep_time, dep_delay , arr_time , sched_arr_time, arr_delay , carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour INTEGER , INTEGER , INTEGER , INTEGER , INTEGER , DOUBLE , INTEGER , INTEGER , DOUBLE , VARCHAR , INTEGER , VARCHAR , VARCHAR , VARCHAR , DOUBLE , DOUBLE , DOUBLE , DOUBLE , TIMESTAMP FALSE flights main planes tailnum , year , type , manufacturer, model , engines , seats , speed , engine VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR FALSE flights main weather origin , year , month , day , hour , temp , dewp , humid , wind_dir , wind_speed, wind_gust , precip , pressure , visib , time_hour VARCHAR , INTEGER , INTEGER , INTEGER , INTEGER , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , TIMESTAMP FALSE
Select Specific Columns
To get the columns from a data frame, use the
SELECT
command.The syntax is like this
SQL
SELECT <column1>, <column2>, <column3> FROM <mytable>;
Let’s use this to get the
tailnum
,year
, andmodel
variables from theplanes
table.SQL
SELECT "tailnum", "year", "model" FROM planes;
Displaying records 1 - 10 tailnum year model N10156 2004 EMB-145XR N102UW 1998 A320-214 N103US 1999 A320-214 N104UW 1999 A320-214 N10575 2002 EMB-145LR N105UW 1999 A320-214 N107US 1999 A320-214 N108UW 1999 A320-214 N109UW 1999 A320-214 N110UW 1999 A320-214 R equivalent
R
|> planes select(tailnum, year, model)
You select every column by using
*
:SQL
SELECT * FROM planes;
Displaying records 1 - 10 tailnum year type manufacturer model engines seats speed engine N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan R equivalent
R
planes
There is no equivalent for excluding columns (like
dplyr::select(-year)
). You just select the ones you want.
Filter rows
You use the
WHERE
command in SQL to filter by rows.SQL
SELECT "flight", "distance", "origin", "dest" FROM flights WHERE "distance" < 50;
1 records flight distance origin dest 1632 17 EWR LGA R equivalent:
R
|> flights select(flight, distance, origin, dest) |> filter(distance < 50)
To test for equality, you just use one equals sign.
SQL
SELECT "flight", "month" FROM flights WHERE "month" = 12;
Displaying records 1 - 10 flight month 745 12 839 12 1895 12 1487 12 2243 12 939 12 3819 12 1441 12 2167 12 605 12 R equivalent
R
|> flights select(flight, month) |> filter(month == 12)
For characters you must use single quotes, not double.
SQL
SELECT "flight", "origin" FROM flights WHERE "origin" = 'JFK';
Displaying records 1 - 10 flight origin 1141 JFK 725 JFK 79 JFK 49 JFK 71 JFK 194 JFK 1806 JFK 1743 JFK 303 JFK 135 JFK R equivalent
R
|> flights select(flight, origin) |> filter(origin == "JFK")
You can select multiple criteria using the
AND
commandSQL
SELECT "flight", "origin", "dest" FROM flights WHERE "origin" = 'JFK' AND "dest" = 'CMH';
Displaying records 1 - 10 flight origin dest 4146 JFK CMH 3783 JFK CMH 4146 JFK CMH 3783 JFK CMH 4146 JFK CMH 3783 JFK CMH 4146 JFK CMH 3783 JFK CMH 4146 JFK CMH 3650 JFK CMH R equivalent:
R
|> flights select(flight, origin, dest) |> filter(origin == "JFK", dest == "CMH")
You can use the
OR
logical operator too. Just put parentheses around your desired order of operations.SQL
SELECT "flight", "origin", "dest" FROM flights WHERE ("origin" = 'JFK' OR "origin" = 'LGA') AND dest = 'CMH';
Displaying records 1 - 10 flight origin dest 4146 JFK CMH 3783 JFK CMH 4146 JFK CMH 3783 JFK CMH 4490 LGA CMH 4485 LGA CMH 4426 LGA CMH 4429 LGA CMH 4626 LGA CMH 4555 LGA CMH R equivalent
R
|> flights select(flight, origin, dest) |> filter(origin == "JFK" | origin == "LGA", dest == "CMH")
Missing data is
NULL
in SQL (instead ofNA
). We can remove them by the special command:SQL
SELECT "flight", "dep_delay" FROM flights WHERE "dep_delay" IS NOT NULL;
Displaying records 1 - 10 flight dep_delay 1545 2 1714 4 1141 2 725 -1 461 -6 1696 -4 507 -5 5708 -3 79 -3 301 -2 R equivalent
R
|> flights select(flight, dep_delay) |> filter(!is.na(dep_delay))
Just use
IS
if you want only the missing data observationsSQL
SELECT "flight", "dep_delay" FROM flights WHERE "dep_delay" IS NULL;
Displaying records 1 - 10 flight dep_delay 4308 NA 791 NA 1925 NA 125 NA 4352 NA 4406 NA 4434 NA 4935 NA 3849 NA 133 NA When you are building a query, you often want to subset the rows while you are finishing it (you don’t want to return the whole table each time you are trouble shooting a query). Use
LIMIT
to show only the top subset.SQL
SELECT "flight", "origin", "dest" FROM flights LIMIT 5;
5 records flight origin dest 1545 EWR IAH 1714 LGA IAH 1141 JFK MIA 725 JFK BQN 461 LGA ATL You can also randomly sample rows via
USING SAMPLE
:SQL
SELECT "flight", "origin", "dest" FROM flights USING SAMPLE 5 ROWS;
5 records flight origin dest 1243 EWR BOS 1061 JFK AUS 507 EWR FLL 1305 JFK IAD 4622 LGA BNA
Arranging Rows
Use
ORDER BY
to rearrange the rows (let’s remove missing values so we can see the ordering)SQL
SELECT "flight", "dep_delay" FROM flights WHERE "dep_delay" IS NOT NULL ORDER BY "dep_delay";
Displaying records 1 - 10 flight dep_delay 97 -43 1715 -33 5713 -32 1435 -30 837 -27 3478 -26 4573 -25 4361 -25 3318 -24 375 -24 R equivalent
R
|> flights select(flight, dep_delay) |> filter(!is.na(dep_delay)) |> arrange(dep_delay)
Use
DESC
after the variable to arrange in descending orderSQL
SELECT "flight", "dep_delay" FROM flights WHERE "dep_delay" IS NOT NULL ORDER BY "dep_delay" DESC;
Displaying records 1 - 10 flight dep_delay 51 1301 3535 1137 3695 1126 177 1014 3075 1005 2391 960 2119 911 2007 899 2047 898 172 896 R equivalent
R
|> flights select(flight, dep_delay) |> filter(!is.na(dep_delay)) |> arrange(desc(dep_delay))
You break ties by adding more variables in the
ORDER BY
statementSQL
SELECT "flight", "origin", "dep_delay" FROM flights WHERE "dep_delay" IS NOT NULL ORDER BY "origin" DESC, "dep_delay";
Displaying records 1 - 10 flight origin dep_delay 1715 LGA -33 5713 LGA -32 1435 LGA -30 837 LGA -27 3478 LGA -26 4573 LGA -25 375 LGA -24 4065 LGA -24 2223 LGA -24 5956 LGA -23 R equivalent
R
|> flights select(flight, origin, dep_delay) |> filter(!is.na(dep_delay)) |> arrange(desc(origin), dep_delay)
Mutate
In SQL, you mutate variables while you SELECT. You use
AS
to specify what the new variable is called (choosing a variable name is called “aliasing” in SQL).SQL
SELECT <expression> AS <myvariable> FROM <mytable>;
Let’s calculate average speed from the
flights
table. We’ll also keep the flight number, distance, and air time variables.SQL
SELECT "flight", "distance" / "air_time" AS "speed", "distance", "air_time" FROM flights;
Displaying records 1 - 10 flight speed distance air_time 1545 6.167 1400 227 1714 6.238 1416 227 1141 6.806 1089 160 725 8.612 1576 183 461 6.569 762 116 1696 4.793 719 150 507 6.740 1065 158 5708 4.321 229 53 79 6.743 944 140 301 5.312 733 138 R equivalent:
R
|> flights select(flight, distance, air_time) |> mutate(speed = distance / air_time)
Various transformation functions also exist:
LN()
: Natural log transformation.EXP()
: Exponentiation.SQRT()
: Square root.POW()
: Power transformation.POW(2.0, x)
would be \(2^x\)POW(x, 2.0)
would be \(x^2\)
DuckDB is good about implicit coercion. E.g., the following integer division results in a double:
SELECT DISTINCT "month", "day", "day" / "month" AS "ratio" FROM flights WHERE "month" >= 5 ORDER BY "month", "day";
Displaying records 1 - 10 month day ratio 5 1 0.2 5 2 0.4 5 3 0.6 5 4 0.8 5 5 1.0 5 6 1.2 5 7 1.4 5 8 1.6 5 9 1.8 5 10 2.0 But many SQL backends are not so good about implicit coercion.
Use
CAST()
to convert to a double before operations that should produce doubles.SELECT DISTINCT "month", "day", CAST("day" AS DOUBLE) / CAST("month" AS DOUBLE) AS "ratio" FROM flights WHERE "month" >= 5 ORDER BY "month", "day";
Displaying records 1 - 10 month day ratio 5 1 0.2 5 2 0.4 5 3 0.6 5 4 0.8 5 5 1.0 5 6 1.2 5 7 1.4 5 8 1.6 5 9 1.8 5 10 2.0 Mutating over partition can be done with
OVER (PARTITION BY <variable>)
. E.g., here is how you find the flight numbers for the longest flights (in terms of air time) from each airportSELECT "flight", "origin", "air_time" FROM (SELECT "flight", "origin", "air_time", MAX("air_time") OVER (PARTITION BY "origin") AS "amax" FROM flights )WHERE "air_time" = "amax"
3 records flight origin air_time 51 JFK 691 15 EWR 695 745 LGA 331 In the above, I chained two SQL queries. Your
FROM
statement can be the output of another SQL query surrounded by parentheses.
Group Summaries
SQL has a few summary functions (SQL calls these “Aggregates”)
COUNT()
: Count the number of rows.AVG()
: Calculate average.MEDIAN()
: Median (not standard across all DBMS’s).SUM()
: Summation.MIN()
: Minimum.MAX()
: Maximum.STDDEV()
: Standard deviation.VARIANCE()
: Variance
By default, all missing data are ignored (like setting
na.rm = TRUE
).These are calculated in a
SELECT
commandLet’s calculate the average departue delay
SQL
SELECT AVG("dep_delay") AS "dep_delay" FROM flights;
1 records dep_delay 12.64 R equivalent:
R
|> flights summarize(dep_delay = mean(dep_delay, na.rm = TRUE))
Use the
GROUP BY
command to calculate group summaries.SQL
SELECT "origin", AVG("dep_delay") FROM flights GROUP BY "origin";
3 records origin avg(dep_delay) LGA 10.35 JFK 12.11 EWR 15.11 R equivalent
R
|> flights select(origin, dep_delay) |> group_by(origin) |> summarize(dep_delay = mean(dep_delay, na.rm = TRUE))
You can get distinct rows by using the prefix
DISTINCT
in aSELECT
statement. E.g. the following will pick up all unique originsSELECT DISTINCT "origin" FROM flights;
3 records origin JFK EWR LGA
Creating Grouped Summaries without Grouping the Results
This section is based on Richard Ressler’s notes.
There may be times when you want to add a grouped summary to the data without collapsing the data into the groups.
As an example, you want to calculate the total departure delay time for each destination so you can use it to then calculate the percentage of that departure delay time for each airline flying to that destination.
You could do that with a summarized dataframe and then a join.
in R, this can be done by using
mutate()
instead ofsummarize()
to add a new column to the data frame while preserving all of the rows.In SQL, you have to indicate you want to use the aggregate function as a Window Function to combine grouped aggregated and non-aggregated data into a single result-set table.
When operating as a window function, the table is partitioned into sets of records based on a Field.
Then, the aggregate function is applied to the set of records in each partition and a new field is added to the record with the aggregated value for that partition.
To use an aggregate function as a window function, use the
OVER
clause with a(PARTITION BY myfield)
modifier.Here is how you find the flight numbers for the longest flights (in terms of air time) from each airport.
SQL
SELECT "flight", "origin", "air_time", MAX("air_time") OVER (PARTITION BY "origin") AS "amax" FROM flightsLIMIT 10 OFFSET 120827;
Displaying records 1 - 10 flight origin air_time amax 1100 EWR 324 695 449 EWR 82 695 4382 EWR 87 695 4691 EWR 114 695 1037 EWR 330 695 1266 EWR 301 695 1199 EWR 38 695 3709 EWR 114 695 4407 EWR 164 695 580 EWR 192 695 Note the
amax
field is new and the values are all the same for the records for each origin.See SQL Window Functions or SQL PARTITION BY Clause overview for other examples.
This approach can be useful as a sub-query (or inner nested query) inside the outer query
FROM
clause.Here is how you find the flight numbers and destination for the longest flights (in terms of air time) from each airport after using the sub-query to find the longest air time from each origin.
SQL
SELECT "flight", "origin", "dest", "air_time" FROM (SELECT "flight", "origin", "dest", "air_time", MAX("air_time") OVER (PARTITION BY "origin") AS "amax" FROM flights )WHERE "air_time" = "amax"
3 records flight origin dest air_time 51 JFK HNL 691 15 EWR HNL 695 745 LGA DEN 331 Note that all fields used in the outer query must be returned by the inner query
Recoding
Use the following
CASE
-WHEN
syntax to recode valuesSQL
SELECT "flight", "origin", CASE WHEN ("origin" = 'JFK') THEN 'John F. Kennedy' WHEN ("origin" = 'LGA') THEN 'LaGaurdia' WHEN ("origin" = 'EWR') THEN 'Newark Liberty' END AS "olong" FROM flights;
Displaying records 1 - 10 flight origin olong 1545 EWR Newark Liberty 1714 LGA LaGaurdia 1141 JFK John F. Kennedy 725 JFK John F. Kennedy 461 LGA LaGaurdia 1696 EWR Newark Liberty 507 EWR Newark Liberty 5708 LGA LaGaurdia 79 JFK John F. Kennedy 301 LGA LaGaurdia R equivalent:
R
|> flights select(flight, origin) |> mutate(olong = case_when( == "JFK" ~ "John F. Kennedy", origin == "LGA" ~ "LaGuardia", origin == "EWR" ~ "Newark Liberty") origin ) ## or |> flights select(flight, origin) |> mutate(olong = recode( origin,"JFK" = "John F. Kennedy", "LGA" = "LaGuardia", "EWR" = "Newark Liberty") )
You can also use
CASE
-WHEN
to recode based on other logical operationsSQL
SELECT "flight", "air_time", CASE WHEN ("air_time" > 2500) THEN 'Long' WHEN ("air_time" <= 2500) THEN 'Short' END AS "qual_dist" FROM flights;
Displaying records 1 - 10 flight air_time qual_dist 1545 227 Short 1714 227 Short 1141 160 Short 725 183 Short 461 116 Short 1696 150 Short 507 158 Short 5708 53 Short 79 140 Short 301 138 Short R equivalent:
R
|> flights select(flight, air_time) |> mutate(qual_dist = case_when( > 2500 ~ "Long", air_time <= 2500 ~ "Short") air_time ) ## or |> flights select(flight, air_time) |> mutate(qual_dist = if_else(air_time > 2500, "Long", "Short"))
Joining
For joining, in the
SELECT
call, you write out all of the columns in both tables that you are joining.If there are shared column names, you need to distinguish between the two via
table1."var"
ortable2."var"
etc…Use
LEFT JOIN
to declare a left join, andON
to declare the keys.SQL
-- flight is from the flights table -- type is from the planes table -- both tables have a tailnum column, so we need to tell them apart -- if you list both tailnums in SELECT, you'll get two tailnum columns SELECT "flight", flights."tailnum", "type" FROM flights JOIN planes ON flights."tailnum" = planes."tailnum";
Displaying records 1 - 10 flight tailnum type 461 N693DL Fixed wing multi engine 4424 N19966 Fixed wing multi engine 6177 N34111 Fixed wing multi engine 731 N319NB Fixed wing multi engine 684 N809UA Fixed wing multi engine 1279 N328NB Fixed wing multi engine 1691 N34137 Fixed wing multi engine 1447 N117UW Fixed wing multi engine 3574 N790SW Fixed wing multi engine 3351 N711MQ Fixed wing multi engine R equivalent:
R
|> planes select(tailnum, type) -> planes2|> flights select(flight, tailnum) |> left_join(planes2, by = "tailnum")
# A tibble: 336,776 × 3 flight tailnum type <int> <chr> <chr> 1 1545 N14228 Fixed wing multi engine 2 1714 N24211 Fixed wing multi engine 3 1141 N619AA Fixed wing multi engine 4 725 N804JB Fixed wing multi engine 5 461 N668DN Fixed wing multi engine 6 1696 N39463 Fixed wing multi engine 7 507 N516JB Fixed wing multi engine 8 5708 N829AS Fixed wing multi engine 9 79 N593JB Fixed wing multi engine 10 301 N3ALAA <NA> # ℹ 336,766 more rows
The other joins are:
RIGHT JOIN
INNER JOIN
FULL OUTER JOIN
Creating some Tables
You can create tables using SQL.
Let’s create a new, temporary, connection that we can put some example tables in.
R
<- dbConnect(duckdb()) tmpcon
You can use
CREATE TABLE
to declare new tables andINSERT INTO
to insert new rows into that table.- Note that I am putting a semicolon “
;
” after each SQL call in the same code chunk.
SQL
CREATE TABLE table4a ( "country" VARCHAR, "1999" INTEGER, "2000" INTEGER ); INSERT INTO table4a ("country", "1999", "2000") VALUES 'Afghanistan', 745, 2666), ('Brazil' , 37737, 80488), ('China' , 212258, 213766); (
- Note that I am putting a semicolon “
But
DBI::dbWriteTable()
will add a table to a connection from R.R
data("table1", package = "tidyr") dbWriteTable(conn = tmpcon, name = "table1", value = table1) data("table2", package = "tidyr") dbWriteTable(conn = tmpcon, name = "table2", value = table2) data("table3", package = "tidyr") dbWriteTable(conn = tmpcon, name = "table3", value = table3) data("table4b", package = "tidyr") dbWriteTable(conn = tmpcon, name = "table4b", value = table4b) data("table5", package = "tidyr") dbWriteTable(conn = tmpcon, name = "table5", value = table5)
Here they are:
SQL
DESCRIBE;
6 records database schema name column_names column_types temporary memory main table1 country , year , cases , population VARCHAR, DOUBLE , DOUBLE , DOUBLE FALSE memory main table2 country, year , type , count VARCHAR, DOUBLE , VARCHAR, DOUBLE FALSE memory main table3 country, year , rate VARCHAR, DOUBLE , VARCHAR FALSE memory main table4a country, 1999 , 2000 VARCHAR, INTEGER, INTEGER FALSE memory main table4b country, 1999 , 2000 VARCHAR, DOUBLE , DOUBLE FALSE memory main table5 country, century, year , rate VARCHAR, VARCHAR, VARCHAR, VARCHAR FALSE
Tidyr stuff
Recall the four main verbs of tidying: gathering, spreading, separating, and uniting.
These are very rarely used in SQL.
SQL is used mostly to query a subset of data. You would then load in that data using more advanced methods (R, Python, Excel, etc).
It’s still possible to spread and gather (but not separate and unite), but a huge pain and folks don’t typically do it.
Writing Tables to a CSV File
You can use
COPY
TO
to write the outputs of a SQL query to a CSV file.The syntax for this is
SQL
COPY (query goes here Your SQL )TO 'myfile.csv' (HEADER, DELIMITER ',');
Let’s write some group summaries to a CSV file
SQL
COPY (SELECT "origin", AVG("dep_delay") FROM flights GROUP BY "origin" )TO 'summaries.csv' (HEADER, DELIMITER ',');
This is what the resulting file looks like:
origin,avg(dep_delay) EWR,15.10795435218885 LGA,10.3468756464944 JFK,12.112159099217665
Passing SQL output to R in R Markdown
The
output.var
option in an R Markdown SQL chunk allows the output of SQL query to be called a new variable.E.g., this chunk option will make the SQL output to a new data frame called
df
(don’t forget quotes around the new variable name).```{sql, connection=con, output.var="df"} ```
Here is me setting the output variable to
df
:SQL
SELECT * FROM flights;
Here is the
df
variable in R:R
glimpse(df)
Rows: 336,776 Columns: 19 $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, … $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, … $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1… $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,… $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,… $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1… $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "… $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4… $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394… $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",… $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",… $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1… $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, … $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6… $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0… $ time_hour <dttm> 2013-01-01 10:00:00, 2013-01-01 10:00:00, 2013-01-01 1…
Calling SQL from R
Use
DBI::dbGetQuery()
to run SQL code in R and obtain the result as a data frame.R
<- DBI::dbGetQuery(conn = con, statement = "SELECT * FROM planes") planes2 glimpse(planes2)
Rows: 3,322 Columns: 9 $ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW… $ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199… $ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi… $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU… $ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145… $ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … $ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5… $ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… $ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
Often, folks have SQL written in a separate file (that ends in “.sql”).
E.g., here is what is in “query.sql”:
-- Just gets some planes SELECT * FROM planes;
If you want to load the results of a SQL query in R, saved in “query.sql”, do
R
<- DBI::dbGetQuery(conn = con, statement = read_file("query.sql")) mydf glimpse(mydf)
Rows: 3,322 Columns: 9 $ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW… $ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199… $ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi… $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU… $ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145… $ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, … $ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5… $ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N… $ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
Use R to generate SQL with {dbplyr}
{dbplyr}
allows you to use{dplyr}
code on a SQL backend.It will even generate SQL code for you, so you can use it as a way to learn SQL.
Let’s load it in:
R
library(dbplyr)
You retrieve a table from a SQL database using
tbl()
R
<- tbl(src = con, "flights") flights2
Now you can use your tidyverse code on the flights table
R
|> flights2 select(flight, origin, dest, dep_delay) |> filter(origin == "JFK", dest == "CMH") |> summarize(dep_delay = mean(dep_delay, na.rm = TRUE))
# Source: SQL [?? x 1] # Database: DuckDB v1.2.2 [root@Darwin 24.5.0:R 4.5.0//Users/dgerard/Library/CloudStorage/Dropbox/teaching/stat_413_613/lectures/data/flights.duckdb] dep_delay <dbl> 1 22.0
You execute the query by using
collect()
at the end.R
|> flights2 select(flight, origin, dest, dep_delay) |> filter(origin == "JFK", dest == "CMH") |> summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |> collect()
# A tibble: 1 × 1 dep_delay <dbl> 1 22.0
To see the SQL code that was generated, use
show_query()
at the end (this is a good way to learn SQL).R
|> flights2 select(flight, origin, dest, dep_delay) |> filter(origin == "JFK", dest == "CMH") |> summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |> show_query()
<SQL> SELECT AVG(dep_delay) AS dep_delay FROM ( SELECT flight, origin, dest, dep_delay FROM flights WHERE (origin = 'JFK') AND (dest = 'CMH') ) q01
Closing the Connection
After you are done with SQL, you should close down your connection:
R
::dbDisconnect(con, shutdown = TRUE) DBI::dbDisconnect(tmpcon, shutdown = TRUE) DBI
Exercises
The file starwars.duckdb contains the Star Wars database from the {starwarsdb}
R package. Only use SQL to answer the following questions.
Download these data and open a connection to the database.
Print out a summary of the tables in this database.
Select the entire
people
table.Select just the
name
,height
,mass
, andspecies
variables from thepeople
table.Add to the above query by selecting only the humans and droids.
Remove the individuals with missing
mass
data from the above query.Modify the above query to calculate the average height and mass for humans and droids.
Make sure that Droids are in the first row and Humans are in the second row in the above summary output.
Here is the summary of the keys for the database from the
{starwarsdb}
GitHub page.Select all films with characters whose homeworld is Kamino.
Filter the
people
table to only contain humans from Tatooine and export the result to a CSV file called “folks.csv”.Close the SQL connection.