data.table
syntax for manipulating data frames.Big data: Can’t read data into memory. Need to do distributed computing (like Spark).
Biggish data: Can read data into memory, but it’s very slow to manipulate it.
The data.table package helps you with biggish data.
data.frame
s are a base R class. We’ve learned about
them before.
There are many competitors to data.frame
s: tibble
, DataFrame
(from Bioconductor), and data.table
.
Julia (another statistical programming language) has its own DataFrames
class. Python’s data frame packages are pandas
and polars
.
data.table
is among the fastest in most benchmarks:
https://h2oai.github.io/db-benchmark/.
polars
, and Julia’s DataFrames.jl
come close and are faster in some benchmarks.dplyr
doesn’t come close.So why did we learn the tidyverse?
dplyr
is
more widely used.Load the data.table package and (to compare) the tidyverse into R:
library(tidyverse)
library(data.table)
We’ll demonstrate most methods with the data from “NYC-flights14” dataset from the data.table package vignettes. I’ve posted it on my own webpage: https://data-science-master.github.io/lectures/data/flights14.csv.
Read in data with fread()
(for “file read”). It
accepts all delimiters, which you (optionally) specify with the
sep
argument. The default is to guess the delimiter.
flights <- fread("../data/flights14.csv")
fread()
will return a data.table
object.
class(flights)
## [1] "data.table" "data.frame"
Use fwrite()
to write a data.table
object to a file.
Compare to read_csv()
in the tidyverse.
flights_tib <- read_csv("../data/flights14.csv")
read_csv()
will return a tibble.
class(flights_tib)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
tibble
s and data.table
s both print
better than data.frame
s.
flights
flights_tib
You usually use the base function str()
(for
“structure”) to look at the data.table
entries.
## data.table way
str(flights)
## Similar tidyverse way
glimpse(flights_tib)
You can use as.data.table()
to convert a
tibble
or a data.frame
into a
data.table
. But there is rarely a time when you’d do this,
since you use data.table for large datasets that you read in with
fread()
.
temp_dt <- as.data.table(flights_tib)
class(temp_dt)
## [1] "data.table" "data.frame"
Just like in the tidyverse, we use logicals to filter based on rows. The syntax for this is to place the logicals inside a bracket. Let’s find all flights that left JFK and arrived at LAX.
## data.table way
flights[origin == "JFK" & dest == "LAX"]
## equivalent tidyverse way
flights_tib %>%
filter(origin == "JFK", dest == "LAX")
To get a specific row, insert a number into the brackets.
## data.table way
flights[c(1, 3, 207)]
## equivalent tidyverse way
flights_tib %>%
slice(1, 3, 207)
Reorder rows by using the order()
function inside
the brackets. Let’s reorder the rows alphabetically by origin, and break
ties in reverse alphabetical order by destination.
## data.table way
flights[order(origin, -dest)]
## equivalent tidyverse way
flights_tib %>%
arrange(origin, desc(dest))
Exercise: Use both data.table and the tidyverse to select all flights from EWR and LGA, and arrange the flights in decreasing order of departure delay.
To select a variable (a column), you also use bracket notation, but you place a comma before you select the columns. This idea is that you are selecting all rows (empty space before comma).
There are lots of ways to select columns that keeps the new
object a data.table
.
List method: Use the .()
function
(which is an alias for list()
).
## data.table way
flights[, .(origin, dest)]
## Or
flights[, list(origin, dest)]
## equivalent tidyverse way
flights_tib %>%
select(origin, dest)
Character Vector Method: Use c()
with their character names
flights[, c("origin", "dest")]
Range Method: Use :
to select
variables within a range of columns.
flights[, origin:dest]
Prespecify Method: Define variables to keep
outside of the data.table
, then use
with = FALSE
. This option makes data.table not think that
keep_vec
is a varible in the data.table
.
keep_vec <- c("origin", "dest")
flights[, keep_vec, with = FALSE]
To remove a column using the range or character methods, place a
!
before the columns to remove
flights[, !c("year", "month")]
flights[, !(year:month)]
To remove a column using the list method, assign that variable to
be NULL
using modify by reference (see below). If you run
the below code, you’ll need to reload the flights
data to
get back year
and month
.
flights[, c("year", "month") := .(NULL, NULL)]
Exercise: Use data.table to select the
year
, month
, day
, and
hour
columns.
Unlike the tidyverse, you filter rows and select columns in one call rather than using two separate functions.
## data.table way
flights[origin == "JFK" & dest == "LAX", .(year, month, day, hour)]
## equivalent tidyverse way
flights_tib %>%
filter(origin == "JFK", dest == "LAX") %>%
select(year, month, day, hour)
Exercise: Use data.table to print out just the departure delays from JFK.
The fastest way to create and remove variables
in a data.table
is by reference,
where we modify the data.table
, we don’t
create a new data.table
.
Use :=
to modify a data.table by reference. You put
the variable names (as a character vector) to the left of
:=
. You put the new variables (as a list) to the right of
:=
.
## data.table way
flights[, c("gain") := .(dep_delay - arr_delay)]
flights
## equivalent tidyverse way
flights_tib %>%
mutate(gain = dep_delay - arr_delay) ->
flights_tib
flights_tib
Quickly remove a column by setting that variable to NULL.
## data.table way
flights[, c("gain") := .(NULL)]
flights
## equivalent tidyverse way
flights_tib %>%
select(-gain) ->
flights_tib
flights_tib
Add multiple variables by separating them with columns.
flights[, c("gain", "dist_km") := .(dep_delay - arr_delay, 1.61 * distance)]
flights
flights[, c("gain", "dist_km") := .(NULL, NULL)]
flights
Exercise: Add a variable called
speed
that is the average air speed of the plane in miles
per hour. Then remove this variable.
You calculate summaries in the column slot. It’s best to use the list method.
## data.table way
flights[, .(dd = mean(dep_delay))]
## equivalent tidyverse way
flights_tib %>%
summarize(dd = mean(dep_delay))
## data.table way
flights[, .(dd = mean(dep_delay), ad = mean(arr_delay))]
## equivalent tidyverse way
flights_tib %>%
summarize(dd = mean(dep_delay), ad = mean(arr_delay))
Count the number of rows with .N
.
## data.table way
flights[, .(.N)]
## equivalent tidyverse way
flights_tib %>%
count()
In data.table, you create grouped summaries by simultaneously grouping and calculating summaries in one line, not separately like in dplyr.
In data.table, you use the by
argument to specify
the grouping variable. It should also be a list.
## data.table way
flights[, .(dd = mean(dep_delay)) , by = .(origin)]
## equivalent tidyverse way
flights_tib %>%
group_by(origin) %>%
summarize(dd = mean(dep_delay))
You can calculate more than one group summary at a time, or group by more than one variable.
## data.table way
flights[, .(dd = mean(dep_delay), .N), by = .(origin, carrier)]
## equivalent tidyverse way
flights_tib %>%
group_by(origin, carrier) %>%
summarize(dd = mean(dep_delay), N = n())
## `summarise()` has grouped output by 'origin'. You can override using the
## `.groups` argument.
Exercise: Use data.table to calculate the median air time for each month.
Exercise: Use data.table to calculate the number of trips from each airport for the carrier code DL.
Exercise: Use data.table to calculate the mean departure delay for each origin in the months of January and February.
Suppose you want to change the values of a variable. In the
tidyverse, we used recode()
to do this.
In data.table, we filter the rows then mutate by reference.
Let’s substitute 24
in hour
to
0
.
sort(unique(flights$hour))
sort(unique(flights_tib$hour))
## data.table way
flights[hour == 24L, hour := 0L]
## equivalent tidyverse way
flights_tib %>%
mutate(hour = recode(hour, `24` = 0L)) ->
flights_tib
sort(unique(flights$hour))
sort(unique(flights_tib$hour))
Exercise: In the origin
variable,
change "JFK"
to "John F. Kennedy"
,
"LGA"
to "LaGuardia"
, and "EWR"
to "Newark Liberty"
.
Problem: One variable spread across multiple columns.
Column names are actually values of a variable
Recall table4a
and table4b
from the
tidyr package
dt4a <- as.data.table(tidyr::table4a)
dt4b <- as.data.table(tidyr::table4b)
dt4a
## country 1999 2000
## 1: Afghanistan 745 2666
## 2: Brazil 37737 80488
## 3: China 212258 213766
dt4b
## country 1999 2000
## 1: Afghanistan 1.999e+07 2.060e+07
## 2: Brazil 1.720e+08 1.745e+08
## 3: China 1.273e+09 1.280e+09
Solution: melt()
:
## data.table way
melt(dt4a,
id.vars = c("country"),
measure.vars = c("1999", "2000"),
variable.name = "year",
value.name = "count")
## Equivalent tidyverse way
tidyr::table4a %>%
gather(`1999`, `2000`, key = "year", value = "count")
## or
tidyr::table4a %>%
pivot_longer(cols = c("1999", "2000"),
names_to = "year",
values_to = "count")
RDS visualization:
Exercise: gather the monkeymem
data
frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/monkeymem.csv).
The cell values represent identification accuracy of some objects (in
percent of 20 trials).
Problem: One observation is spread across multiple rows.
One column contains variable names. One column contains values for the different variables.
Recall table2
from the tidyr package
dt2 <- as.data.table(tidyr::table2)
dt2
## country year type count
## 1: Afghanistan 1999 cases 7.450e+02
## 2: Afghanistan 1999 population 1.999e+07
## 3: Afghanistan 2000 cases 2.666e+03
## 4: Afghanistan 2000 population 2.060e+07
## 5: Brazil 1999 cases 3.774e+04
## 6: Brazil 1999 population 1.720e+08
## 7: Brazil 2000 cases 8.049e+04
## 8: Brazil 2000 population 1.745e+08
## 9: China 1999 cases 2.123e+05
## 10: China 1999 population 1.273e+09
## 11: China 2000 cases 2.138e+05
## 12: China 2000 population 1.280e+09
Solution: dcast()
. In the formula
argument, put the “id variables” to the left and the “key” variables to
the right. In tidyverse jargon, the value
is everything not
stated in the formula and the key
is everything to the left
of the tilde.
## data.table way
dcast(dt2, formula = country + year ~ type, value.var = "count")
## Equivalent tidyverse way
tidyr::table2 %>%
spread(key = type, value = count)
## or
tidyr::table2 %>%
pivot_wider(id_cols = c("country", "year"),
names_from = "type",
values_from = "count")
RDS visualization:
Exercise: Spread the flowers1
data
frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/flowers1.csv).
To separate a column into two columns, use the base function
tstrsplit()
.
dt3 <- as.data.table(tidyr::table3)
dt3
## country year rate
## 1: Afghanistan 1999 745/19987071
## 2: Afghanistan 2000 2666/20595360
## 3: Brazil 1999 37737/172006362
## 4: Brazil 2000 80488/174504898
## 5: China 1999 212258/1272915272
## 6: China 2000 213766/1280428583
## data.table way
dt3[, c("cases", "population") := tstrsplit(rate, split = "/")]
dt3[, rate := NULL]
dt3
## equivalent tidyverse way
tidyr::table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
RDS visualization:
Exercise: Separate the flowers2
data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/flowers2.csv).
To unite, use paste()
.
dt5 <- as.data.table(tidyr::table5)
dt5
## country century year rate
## 1: Afghanistan 19 99 745/19987071
## 2: Afghanistan 20 00 2666/20595360
## 3: Brazil 19 99 37737/172006362
## 4: Brazil 20 00 80488/174504898
## 5: China 19 99 212258/1272915272
## 6: China 20 00 213766/1280428583
## data.table way
dt5[, year := paste(century, year, sep = "")]
dt5[, century := NULL]
dt5
## Equivalent tidyverse way
tidyr::table5 %>%
unite(century, year, col = "year", sep = "")
RDS visualization:
Exercise: Re-unite the data frame you separated from the flowers2 exercise. Use a comma for the separator.
In the tidyverse, we chain commands by using the pipe
%>%
. In data.table, we chain commands by adding
additional brackets after the brackets we used. Data.table makes this
very efficient.
Let’s calculate the mean arrival delay for american airlines for each origin/destination pair, then order the results by origin in increasing order, breaking ties by destination in decreasing order.
## data.table way
flights[carrier == "AA", .(ad = mean(arr_delay)), by = .(origin, dest)][order(origin, -dest)]
## Usual indentation for readability:
flights[carrier == "AA", .(ad = mean(arr_delay)), by = .(origin, dest)
][order(origin, -dest)]
## Equivalent tidyverse way
flights_tib %>%
filter(carrier == "AA") %>%
group_by(origin, dest) %>%
summarize(ad = mean(arr_delay)) %>%
arrange(origin, desc(dest))
We’ll use the following data.table
s to introduce
joining.
xdf <- data.table(mykey = c("1", "2", "3"),
x_val = c("x1", "x2", "x3"))
ydf <- data.table(mykey = c("1", "2", "4"),
y_val = c("y1", "y2", "y3"))
xdf
## mykey x_val
## 1: 1 x1
## 2: 2 x2
## 3: 3 x3
ydf
## mykey y_val
## 1: 1 y1
## 2: 2 y2
## 3: 4 y3
Use the merge()
function for all joining in
data.table.
Inner Join:
## data.table way
merge(xdf, ydf, by = "mykey")
## equivalent tidyverse way
inner_join(xdf, ydf, by = "mykey")
Outer Joins
Left Join
## data.table way
merge(xdf, ydf, by = "mykey", all.x = TRUE)
## equivalent tidyverse way
left_join(xdf, ydf, by = "mykey")
Right Join
## data.table way
merge(xdf, ydf, by = "mykey", all.y = TRUE)
## equivalent tidyverse way
right_join(xdf, ydf, by = "mykey")
Outer Join
## data.table way
merge(xdf, ydf, by = "mykey", all.x = TRUE, all.y = TRUE)
## equivalent tidyverse way
full_join(xdf, ydf, by = "mykey")
Binding Rows:
## data.table way
rbind(xdf, ydf, fill = TRUE)
## equivalent tidyverse way
bind_rows(xdf, ydf)
When you have different key names, use the by.x
and
by.y
arguments instead of the by
argument.
names(ydf)[1] <- "newkey"
ydf
## newkey y_val
## 1: 1 y1
## 2: 2 y2
## 3: 4 y3
## data.table way
merge(xdf, ydf, by.x = "mykey", by.y = "newkey")
## equivalent tidyverse way
inner_join(xdf, ydf, by = c("mykey" = "newkey"))
Exercise: Recall the nycflights13
dataset
library(nycflights13)
data("flights")
data("airlines")
data("planes")
flights <- as.data.table(flights)
airlines <- as.data.table(airlines)
planes <- as.data.table(planes)
Add the full airline names to the flights
data.table.
Exercise: Select all flights that use a plane where you have some annotation.
There is so much more to data.table. You can do quite complicated and sophisticated data manipulations using data.table that would require a lot more typing in the tidyverse. I am not an expert in much of this functionality. A large number of articles and references are given on the homepage: https://rdatatable.gitlab.io/data.table/.