library(tidyverse)
library(data.table)
Manipulating Large(ish) Datasets with data.table
Learning Objectives
- Reading large datasets into R.
data.table
syntax for manipulating data frames.- A data.table and dplyr tour
- Introduction to data.table.
- Efficient reshaping using data.tables
- Interactive Cheat Sheet
Data Frames and Motivation
Big data: Can’t read data into memory. Need to do distributed computing (like Spark).
Biggish data: Can read data into memory, but it’s very slow to manipulate it.
The data.table package helps you with biggish data.
data.frame
s are a base R class. We’ve learned about them before.There are many competitors to
data.frame
s:tibble
,DataFrame
(from Bioconductor), anddata.table
.Julia (another statistical programming language) has its own
DataFrames
class. Python’s data frame packages arepandas
andpolars
.data.table
is among the fastest in most benchmarks: https://h2oai.github.io/db-benchmark/.- Python’s
polars
, and Julia’sDataFrames.jl
come close and are faster in some benchmarks. dplyr
doesn’t come close.
- Python’s
So why did we learn the tidyverse?
- The Posit folks are better at marketing, so
dplyr
is more widely used. - For datasets with fewer than, like, one million observations, dplyr works fast enough.
- The Posit folks are better at marketing, so
Syntax
Reading in and Printing Data
Load the data.table package and (to compare) the tidyverse into R:
We’ll demonstrate most methods with the data from “NYC-flights14” dataset from the data.table package vignettes. I’ve posted it on my own webpage: https://data-science-master.github.io/lectures/data/flights14.csv.
Read in data with
fread()
(for “file read”). It accepts all delimiters, which you (optionally) specify with thesep
argument. The default is to guess the delimiter.<- fread("../data/flights14.csv") flights
fread()
will return adata.table
object.class(flights)
[1] "data.table" "data.frame"
Use
fwrite()
to write adata.table
object to a file.Compare to
read_csv()
in the tidyverse.<- read_csv("../data/flights14.csv") flights_tib
read_csv()
will return a tibble.class(flights_tib)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
tibble
s anddata.table
s both print better thandata.frame
s.flights flights_tib
You usually use the base function
str()
(for “structure”) to look at thedata.table
entries.## data.table way str(flights) ## Similar tidyverse way glimpse(flights_tib)
You can use
as.data.table()
to convert atibble
or adata.frame
into adata.table
. But there is rarely a time when you’d do this, since you use data.table for large datasets that you read in withfread()
.<- as.data.table(flights_tib) temp_dt class(temp_dt)
[1] "data.table" "data.frame"
Filtering/Arranging Rows (Observations)
Just like in the tidyverse, we use logicals to filter based on rows. The syntax for this is to place the logicals inside a bracket. Let’s find all flights that left JFK and arrived at LAX.
## data.table way == "JFK" & dest == "LAX"] flights[origin ## equivalent tidyverse way %>% flights_tib filter(origin == "JFK", dest == "LAX")
To get a specific row, insert a number into the brackets.
## data.table way c(1, 3, 207)] flights[ ## equivalent tidyverse way %>% flights_tib slice(1, 3, 207)
Reorder rows by using the
order()
function inside the brackets. Let’s reorder the rows alphabetically by origin, and break ties in reverse alphabetical order by destination.## data.table way order(origin, -dest)] flights[ ## equivalent tidyverse way %>% flights_tib arrange(origin, desc(dest))
Exercise: Use both data.table and the tidyverse to select all flights from EWR and LGA, and arrange the flights in decreasing order of departure delay.
Selecting Columns (Variables)
To select a variable (a column), you also use bracket notation, but you place a comma before you select the columns. This idea is that you are selecting all rows (empty space before comma).
There are lots of ways to select columns that keeps the new object a
data.table
.List method: Use the
.()
function (which is an alias forlist()
).## data.table way flights[, .(origin, dest)] ## Or list(origin, dest)] flights[, ## equivalent tidyverse way %>% flights_tib select(origin, dest)
Character Vector Method: Use
c()
with their character namesc("origin", "dest")] flights[,
Range Method: Use
:
to select variables within a range of columns.:dest] flights[, origin
Prespecify Method: Define variables to keep outside of the
data.table
, then usewith = FALSE
. This option makes data.table not think thatkeep_vec
is a varible in thedata.table
.<- c("origin", "dest") keep_vec = FALSE] flights[, keep_vec, with
To remove a column using the range or character methods, place a
!
before the columns to remove!c("year", "month")] flights[, !(year:month)] flights[,
To remove a column using the list method, assign that variable to be
NULL
using modify by reference (see below). If you run the below code, you’ll need to reload theflights
data to get backyear
andmonth
.c("year", "month") := .(NULL, NULL)] flights[,
Exercise: Use data.table to select the
year
,month
,day
, andhour
columns.Unlike the tidyverse, you filter rows and select columns in one call rather than using two separate functions.
## data.table way == "JFK" & dest == "LAX", .(year, month, day, hour)] flights[origin ## equivalent tidyverse way %>% flights_tib filter(origin == "JFK", dest == "LAX") %>% select(year, month, day, hour)
Exercise: Use data.table to print out just the departure delays from JFK.
Creating New Variables (Mutate)
The fastest way to create and remove variables in a
data.table
is by reference, where we modify thedata.table
, we don’t create a newdata.table
.Use
:=
to modify a data.table by reference. You put the variable names (as a character vector) to the left of:=
. You put the new variables (as a list) to the right of:=
.## data.table way c("gain") := .(dep_delay - arr_delay)] flights[, flights ## equivalent tidyverse way %>% flights_tib mutate(gain = dep_delay - arr_delay) -> flights_tib flights_tib
Quickly remove a column by setting that variable to NULL.
## data.table way c("gain") := .(NULL)] flights[, flights ## equivalent tidyverse way %>% flights_tib select(-gain) -> flights_tib flights_tib
Add multiple variables by separating them with columns.
c("gain", "dist_km") := .(dep_delay - arr_delay, 1.61 * distance)] flights[, flights c("gain", "dist_km") := .(NULL, NULL)] flights[, flights
Exercise: Add a variable called
speed
that is the average air speed of the plane in miles per hour. Then remove this variable.
Group Summaries
You calculate summaries in the column slot. It’s best to use the list method.
## data.table way dd = mean(dep_delay))] flights[, .( ## equivalent tidyverse way %>% flights_tib summarize(dd = mean(dep_delay))
## data.table way dd = mean(dep_delay), ad = mean(arr_delay))] flights[, .( ## equivalent tidyverse way %>% flights_tib summarize(dd = mean(dep_delay), ad = mean(arr_delay))
Count the number of rows with
.N
.## data.table way flights[, .(.N)] ## equivalent tidyverse way %>% flights_tib count()
In data.table, you create grouped summaries by simultaneously grouping and calculating summaries in one line, not separately like in dplyr.
In data.table, you use the
by
argument to specify the grouping variable. It should also be a list.## data.table way dd = mean(dep_delay)) , by = .(origin)] flights[, .( ## equivalent tidyverse way %>% flights_tib group_by(origin) %>% summarize(dd = mean(dep_delay))
You can calculate more than one group summary at a time, or group by more than one variable.
## data.table way dd = mean(dep_delay), .N), by = .(origin, carrier)] flights[, .( ## equivalent tidyverse way %>% flights_tib group_by(origin, carrier) %>% summarize(dd = mean(dep_delay), N = n())
`summarise()` has grouped output by 'origin'. You can override using the `.groups` argument.
Exercise: Use data.table to calculate the median air time for each month.
Exercise: Use data.table to calculate the number of trips from each airport for the carrier code DL.
Exercise: Use data.table to calculate the mean departure delay for each origin in the months of January and February.
Recoding
Suppose you want to change the values of a variable. In the tidyverse, we used
recode()
to do this.In data.table, we filter the rows then mutate by reference.
Let’s substitute
24
inhour
to0
.sort(unique(flights$hour)) sort(unique(flights_tib$hour))
## data.table way == 24L, hour := 0L] flights[hour ## equivalent tidyverse way %>% flights_tib mutate(hour = recode(hour, `24` = 0L)) -> flights_tib
sort(unique(flights$hour)) sort(unique(flights_tib$hour))
Exercise: In the
origin
variable, change"JFK"
to"John F. Kennedy"
,"LGA"
to"LaGuardia"
, and"EWR"
to"Newark Liberty"
.
Gathering
Problem: One variable spread across multiple columns.
Column names are actually values of a variable
Recall
table4a
andtable4b
from the tidyr package<- as.data.table(tidyr::table4a) dt4a <- as.data.table(tidyr::table4b) dt4b dt4a
country 1999 2000 <char> <num> <num> 1: Afghanistan 745 2666 2: Brazil 37737 80488 3: China 212258 213766
dt4b
country 1999 2000 <char> <num> <num> 1: Afghanistan 1.999e+07 2.060e+07 2: Brazil 1.720e+08 1.745e+08 3: China 1.273e+09 1.280e+09
Solution:
melt()
:## data.table way melt(dt4a, id.vars = c("country"), measure.vars = c("1999", "2000"), variable.name = "year", value.name = "count") ## Equivalent tidyverse way ::table4a %>% tidyrgather(`1999`, `2000`, key = "year", value = "count") ## or ::table4a %>% tidyrpivot_longer(cols = c("1999", "2000"), names_to = "year", values_to = "count")
RDS visualization:
Exercise: gather the
monkeymem
data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/monkeymem.csv). The cell values represent identification accuracy of some objects (in percent of 20 trials).
Spreading
Problem: One observation is spread across multiple rows.
One column contains variable names. One column contains values for the different variables.
Recall
table2
from the tidyr package<- as.data.table(tidyr::table2) dt2 dt2
country year type count <char> <num> <char> <num> 1: Afghanistan 1999 cases 7.450e+02 2: Afghanistan 1999 population 1.999e+07 3: Afghanistan 2000 cases 2.666e+03 4: Afghanistan 2000 population 2.060e+07 5: Brazil 1999 cases 3.774e+04 6: Brazil 1999 population 1.720e+08 7: Brazil 2000 cases 8.049e+04 8: Brazil 2000 population 1.745e+08 9: China 1999 cases 2.123e+05 10: China 1999 population 1.273e+09 11: China 2000 cases 2.138e+05 12: China 2000 population 1.280e+09
Solution:
dcast()
. In theformula
argument, put the “id variables” to the left and the “key” variables to the right. In tidyverse jargon, thevalue
is everything not stated in the formula and thekey
is everything to the left of the tilde.## data.table way dcast(dt2, formula = country + year ~ type, value.var = "count") ## Equivalent tidyverse way ::table2 %>% tidyrspread(key = type, value = count) ## or ::table2 %>% tidyrpivot_wider(id_cols = c("country", "year"), names_from = "type", values_from = "count")
RDS visualization:
Exercise: Spread the
flowers1
data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/flowers1.csv).
Separating
To separate a column into two columns, use the base function
tstrsplit()
.<- as.data.table(tidyr::table3) dt3 dt3
country year rate <char> <num> <char> 1: Afghanistan 1999 745/19987071 2: Afghanistan 2000 2666/20595360 3: Brazil 1999 37737/172006362 4: Brazil 2000 80488/174504898 5: China 1999 212258/1272915272 6: China 2000 213766/1280428583
## data.table way c("cases", "population") := tstrsplit(rate, split = "/")] dt3[, := NULL] dt3[, rate dt3 ## equivalent tidyverse way ::table3 %>% tidyrseparate(rate, into = c("cases", "population"), sep = "/")
RDS visualization:
Exercise: Separate the
flowers2
data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/flowers2.csv).
Uniting
To unite, use
paste()
.<- as.data.table(tidyr::table5) dt5 dt5
country century year rate <char> <char> <char> <char> 1: Afghanistan 19 99 745/19987071 2: Afghanistan 20 00 2666/20595360 3: Brazil 19 99 37737/172006362 4: Brazil 20 00 80488/174504898 5: China 19 99 212258/1272915272 6: China 20 00 213766/1280428583
## data.table way := paste(century, year, sep = "")] dt5[, year := NULL] dt5[, century dt5 ## Equivalent tidyverse way ::table5 %>% tidyrunite(century, year, col = "year", sep = "")
RDS visualization:
Exercise: Re-unite the data frame you separated from the flowers2 exercise. Use a comma for the separator.
Chaining
In the tidyverse, we chain commands by using the pipe
%>%
. In data.table, we chain commands by adding additional brackets after the brackets we used. Data.table makes this very efficient.Let’s calculate the mean arrival delay for american airlines for each origin/destination pair, then order the results by origin in increasing order, breaking ties by destination in decreasing order.
## data.table way == "AA", .(ad = mean(arr_delay)), by = .(origin, dest)][order(origin, -dest)] flights[carrier ## Usual indentation for readability: == "AA", .(ad = mean(arr_delay)), by = .(origin, dest) flights[carrier order(origin, -dest)] ][ ## Equivalent tidyverse way %>% flights_tib filter(carrier == "AA") %>% group_by(origin, dest) %>% summarize(ad = mean(arr_delay)) %>% arrange(origin, desc(dest))
Joining
We’ll use the following
data.table
s to introduce joining.<- data.table(mykey = c("1", "2", "3"), xdf x_val = c("x1", "x2", "x3")) <- data.table(mykey = c("1", "2", "4"), ydf y_val = c("y1", "y2", "y3")) xdf
mykey x_val <char> <char> 1: 1 x1 2: 2 x2 3: 3 x3
ydf
mykey y_val <char> <char> 1: 1 y1 2: 2 y2 3: 4 y3
Use the
merge()
function for all joining in data.table.Inner Join:
## data.table way merge(xdf, ydf, by = "mykey") ## equivalent tidyverse way inner_join(xdf, ydf, by = "mykey")
Outer Joins
Left Join
## data.table way merge(xdf, ydf, by = "mykey", all.x = TRUE) ## equivalent tidyverse way left_join(xdf, ydf, by = "mykey")
Right Join
## data.table way merge(xdf, ydf, by = "mykey", all.y = TRUE) ## equivalent tidyverse way right_join(xdf, ydf, by = "mykey")
Outer Join
## data.table way merge(xdf, ydf, by = "mykey", all.x = TRUE, all.y = TRUE) ## equivalent tidyverse way full_join(xdf, ydf, by = "mykey")
Binding Rows:
## data.table way rbind(xdf, ydf, fill = TRUE) ## equivalent tidyverse way bind_rows(xdf, ydf)
When you have different key names, use the
by.x
andby.y
arguments instead of theby
argument.names(ydf)[1] <- "newkey" ydf
newkey y_val <char> <char> 1: 1 y1 2: 2 y2 3: 4 y3
## data.table way merge(xdf, ydf, by.x = "mykey", by.y = "newkey") ## equivalent tidyverse way inner_join(xdf, ydf, by = c("mykey" = "newkey"))
Exercise: Recall the
nycflights13
datasetlibrary(nycflights13) data("flights") data("airlines") data("planes") <- as.data.table(flights) flights <- as.data.table(airlines) airlines <- as.data.table(planes) planes
Add the full airline names to the
flights
data.table.Exercise: Select all flights that use a plane where you have some annotation.
Additional References
There is so much more to data.table. You can do quite complicated and sophisticated data manipulations using data.table that would require a lot more typing in the tidyverse. I am not an expert in much of this functionality. A large number of articles and references are given on the homepage: https://rdatatable.gitlab.io/data.table/.