Manipulating Large(ish) Datasets with data.table

Author

David Gerard

Published

June 3, 2025

Learning Objectives

Data Frames and Motivation

  • Big data: Can’t read data into memory. Need to do distributed computing (like Spark).

  • Biggish data: Can read data into memory, but it’s very slow to manipulate it.

  • The data.table package helps you with biggish data.

  • data.frames are a base R class. We’ve learned about them before.

  • There are many competitors to data.frames: tibble, DataFrame (from Bioconductor), and data.table.

  • Julia (another statistical programming language) has its own DataFrames class. Python’s data frame packages are pandas and polars.

  • data.table is among the fastest in most benchmarks: https://h2oai.github.io/db-benchmark/.

    • Python’s polars, and Julia’s DataFrames.jl come close and are faster in some benchmarks.
    • dplyr doesn’t come close.
  • So why did we learn the tidyverse?

    • The Posit folks are better at marketing, so dplyr is more widely used.
    • For datasets with fewer than, like, one million observations, dplyr works fast enough.

Syntax

Reading in and Printing Data

  • Load the data.table package and (to compare) the tidyverse into R:

    library(tidyverse)
    library(data.table)
  • We’ll demonstrate most methods with the data from “NYC-flights14” dataset from the data.table package vignettes. I’ve posted it on my own webpage: https://data-science-master.github.io/lectures/data/flights14.csv.

  • Read in data with fread() (for “file read”). It accepts all delimiters, which you (optionally) specify with the sep argument. The default is to guess the delimiter.

    flights <- fread("../data/flights14.csv")
  • fread() will return a data.table object.

    class(flights)
    [1] "data.table" "data.frame"
  • Use fwrite() to write a data.table object to a file.

  • Compare to read_csv() in the tidyverse.

    flights_tib <- read_csv("../data/flights14.csv")
  • read_csv() will return a tibble.

    class(flights_tib)
    [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
  • tibbles and data.tables both print better than data.frames.

    flights
    flights_tib
  • You usually use the base function str() (for “structure”) to look at the data.table entries.

    ## data.table way
    str(flights)
    
    ## Similar tidyverse way
    glimpse(flights_tib)
  • You can use as.data.table() to convert a tibble or a data.frame into a data.table. But there is rarely a time when you’d do this, since you use data.table for large datasets that you read in with fread().

    temp_dt <- as.data.table(flights_tib)
    class(temp_dt)
    [1] "data.table" "data.frame"

Filtering/Arranging Rows (Observations)

  • Just like in the tidyverse, we use logicals to filter based on rows. The syntax for this is to place the logicals inside a bracket. Let’s find all flights that left JFK and arrived at LAX.

    ## data.table way
    flights[origin == "JFK" & dest == "LAX"]
    
    ## equivalent tidyverse way
    flights_tib %>%
      filter(origin == "JFK", dest == "LAX")
  • To get a specific row, insert a number into the brackets.

    ## data.table way
    flights[c(1, 3, 207)]
    
    ## equivalent tidyverse way
    flights_tib %>%
      slice(1, 3, 207)
  • Reorder rows by using the order() function inside the brackets. Let’s reorder the rows alphabetically by origin, and break ties in reverse alphabetical order by destination.

    ## data.table way
    flights[order(origin, -dest)]
    
    ## equivalent tidyverse way    
    flights_tib %>%
      arrange(origin, desc(dest))
  • Exercise: Use both data.table and the tidyverse to select all flights from EWR and LGA, and arrange the flights in decreasing order of departure delay.

Selecting Columns (Variables)

  • To select a variable (a column), you also use bracket notation, but you place a comma before you select the columns. This idea is that you are selecting all rows (empty space before comma).

  • There are lots of ways to select columns that keeps the new object a data.table.

  • List method: Use the .() function (which is an alias for list()).

    ## data.table way
    flights[, .(origin, dest)]
    
    ## Or
    flights[, list(origin, dest)]
    
    ## equivalent tidyverse way
    flights_tib %>%
      select(origin, dest)
  • Character Vector Method: Use c() with their character names

    flights[, c("origin", "dest")]
  • Range Method: Use : to select variables within a range of columns.

    flights[, origin:dest]
  • Prespecify Method: Define variables to keep outside of the data.table, then use with = FALSE. This option makes data.table not think that keep_vec is a varible in the data.table.

    keep_vec <- c("origin", "dest")
    flights[, keep_vec, with = FALSE]
  • To remove a column using the range or character methods, place a ! before the columns to remove

    flights[, !c("year", "month")]
    flights[, !(year:month)]
  • To remove a column using the list method, assign that variable to be NULL using modify by reference (see below). If you run the below code, you’ll need to reload the flights data to get back year and month.

    flights[, c("year", "month") := .(NULL, NULL)]
  • Exercise: Use data.table to select the year, month, day, and hour columns.

  • Unlike the tidyverse, you filter rows and select columns in one call rather than using two separate functions.

    ## data.table way
    flights[origin == "JFK" & dest == "LAX", .(year, month, day, hour)]
    
    ## equivalent tidyverse way
    flights_tib %>%
      filter(origin == "JFK", dest == "LAX") %>%
      select(year, month, day, hour)
  • Exercise: Use data.table to print out just the departure delays from JFK.

Creating New Variables (Mutate)

  • The fastest way to create and remove variables in a data.table is by reference, where we modify the data.table, we don’t create a new data.table.

  • Use := to modify a data.table by reference. You put the variable names (as a character vector) to the left of :=. You put the new variables (as a list) to the right of :=.

    ## data.table way
    flights[, c("gain") := .(dep_delay - arr_delay)]
    flights
    
    ## equivalent tidyverse way    
    flights_tib %>%
      mutate(gain = dep_delay - arr_delay) ->
      flights_tib
    flights_tib
  • Quickly remove a column by setting that variable to NULL.

    ## data.table way
    flights[, c("gain") := .(NULL)]
    flights
    
    ## equivalent tidyverse way
    flights_tib %>%
      select(-gain) ->
      flights_tib
    flights_tib
  • Add multiple variables by separating them with columns.

    flights[, c("gain", "dist_km") := .(dep_delay - arr_delay, 1.61 * distance)]
    flights
    
    flights[, c("gain", "dist_km") := .(NULL, NULL)]
    flights
  • Exercise: Add a variable called speed that is the average air speed of the plane in miles per hour. Then remove this variable.

Group Summaries

  • You calculate summaries in the column slot. It’s best to use the list method.

    ## data.table way
    flights[, .(dd = mean(dep_delay))]
    
    ## equivalent tidyverse way
    flights_tib %>%
      summarize(dd = mean(dep_delay))
    ## data.table way
    flights[, .(dd = mean(dep_delay), ad = mean(arr_delay))]
    
    ## equivalent tidyverse way
    flights_tib %>%
      summarize(dd = mean(dep_delay), ad = mean(arr_delay))
  • Count the number of rows with .N.

    ## data.table way
    flights[, .(.N)]
    
    ## equivalent tidyverse way
    flights_tib %>%
      count()
  • In data.table, you create grouped summaries by simultaneously grouping and calculating summaries in one line, not separately like in dplyr.

  • In data.table, you use the by argument to specify the grouping variable. It should also be a list.

    ## data.table way
    flights[, .(dd = mean(dep_delay)) , by = .(origin)]
    
    ## equivalent tidyverse way
    flights_tib %>%
      group_by(origin) %>%
      summarize(dd = mean(dep_delay))
  • You can calculate more than one group summary at a time, or group by more than one variable.

    ## data.table way
    flights[, .(dd = mean(dep_delay), .N), by = .(origin, carrier)]
    
    ## equivalent tidyverse way
    flights_tib %>%
      group_by(origin, carrier) %>%
      summarize(dd = mean(dep_delay), N = n())
    `summarise()` has grouped output by 'origin'. You can override using the
    `.groups` argument.
  • Exercise: Use data.table to calculate the median air time for each month.

  • Exercise: Use data.table to calculate the number of trips from each airport for the carrier code DL.

  • Exercise: Use data.table to calculate the mean departure delay for each origin in the months of January and February.

Recoding

  • Suppose you want to change the values of a variable. In the tidyverse, we used recode() to do this.

  • In data.table, we filter the rows then mutate by reference.

  • Let’s substitute 24 in hour to 0.

    sort(unique(flights$hour))
    sort(unique(flights_tib$hour))
    ## data.table way
    flights[hour == 24L, hour := 0L]
    
    ## equivalent tidyverse way
    flights_tib %>%
      mutate(hour = recode(hour, `24` = 0L)) ->
      flights_tib
    sort(unique(flights$hour))
    sort(unique(flights_tib$hour))
  • Exercise: In the origin variable, change "JFK" to "John F. Kennedy", "LGA" to "LaGuardia", and "EWR" to "Newark Liberty".

Gathering

  • Problem: One variable spread across multiple columns.

  • Column names are actually values of a variable

  • Recall table4a and table4b from the tidyr package

    dt4a <- as.data.table(tidyr::table4a)
    dt4b <- as.data.table(tidyr::table4b)
    dt4a
           country   1999   2000
            <char>  <num>  <num>
    1: Afghanistan    745   2666
    2:      Brazil  37737  80488
    3:       China 212258 213766
    dt4b
           country      1999      2000
            <char>     <num>     <num>
    1: Afghanistan 1.999e+07 2.060e+07
    2:      Brazil 1.720e+08 1.745e+08
    3:       China 1.273e+09 1.280e+09
  • Solution: melt():

    ## data.table way
    melt(dt4a, 
         id.vars       = c("country"),
         measure.vars  = c("1999", "2000"),
         variable.name = "year",
         value.name    = "count")
    
    ## Equivalent tidyverse way
    tidyr::table4a %>%
      gather(`1999`, `2000`, key = "year", value = "count")
    
    ## or
    tidyr::table4a %>%
      pivot_longer(cols = c("1999", "2000"), 
                   names_to = "year", 
                   values_to = "count")
  • RDS visualization:


  • Exercise: gather the monkeymem data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/monkeymem.csv). The cell values represent identification accuracy of some objects (in percent of 20 trials).

Spreading

  • Problem: One observation is spread across multiple rows.

  • One column contains variable names. One column contains values for the different variables.

  • Recall table2 from the tidyr package

    dt2 <- as.data.table(tidyr::table2)
    dt2
            country  year       type     count
             <char> <num>     <char>     <num>
     1: Afghanistan  1999      cases 7.450e+02
     2: Afghanistan  1999 population 1.999e+07
     3: Afghanistan  2000      cases 2.666e+03
     4: Afghanistan  2000 population 2.060e+07
     5:      Brazil  1999      cases 3.774e+04
     6:      Brazil  1999 population 1.720e+08
     7:      Brazil  2000      cases 8.049e+04
     8:      Brazil  2000 population 1.745e+08
     9:       China  1999      cases 2.123e+05
    10:       China  1999 population 1.273e+09
    11:       China  2000      cases 2.138e+05
    12:       China  2000 population 1.280e+09
  • Solution: dcast(). In the formula argument, put the “id variables” to the left and the “key” variables to the right. In tidyverse jargon, the value is everything not stated in the formula and the key is everything to the left of the tilde.

    ## data.table way
    dcast(dt2, formula = country + year ~ type, value.var = "count")
    
    ## Equivalent tidyverse way
    tidyr::table2 %>%
      spread(key = type, value = count)
    
    ## or
    tidyr::table2 %>%
      pivot_wider(id_cols = c("country", "year"), 
                  names_from = "type", 
                  values_from = "count")
  • RDS visualization:

     

  • Exercise: Spread the flowers1 data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/flowers1.csv).

Separating

  • To separate a column into two columns, use the base function tstrsplit().

    dt3 <- as.data.table(tidyr::table3)
    dt3
           country  year              rate
            <char> <num>            <char>
    1: Afghanistan  1999      745/19987071
    2: Afghanistan  2000     2666/20595360
    3:      Brazil  1999   37737/172006362
    4:      Brazil  2000   80488/174504898
    5:       China  1999 212258/1272915272
    6:       China  2000 213766/1280428583
    ## data.table way
    dt3[, c("cases", "population") := tstrsplit(rate, split = "/")]
    dt3[, rate := NULL]
    dt3
    
    ## equivalent tidyverse way
    tidyr::table3 %>%
      separate(rate, into = c("cases", "population"), sep = "/")
  • RDS visualization:

     

  • Exercise: Separate the flowers2 data frame (available at https://data-science-master.github.io/lectures/data/tidy_exercise/flowers2.csv).

Uniting

  • To unite, use paste().

    dt5 <- as.data.table(tidyr::table5)
    dt5
           country century   year              rate
            <char>  <char> <char>            <char>
    1: Afghanistan      19     99      745/19987071
    2: Afghanistan      20     00     2666/20595360
    3:      Brazil      19     99   37737/172006362
    4:      Brazil      20     00   80488/174504898
    5:       China      19     99 212258/1272915272
    6:       China      20     00 213766/1280428583
    ## data.table way
    dt5[, year := paste(century, year, sep = "")]
    dt5[, century := NULL]
    dt5
    
    ## Equivalent tidyverse way
    tidyr::table5 %>%
      unite(century, year, col = "year", sep = "")
  • RDS visualization:


  • Exercise: Re-unite the data frame you separated from the flowers2 exercise. Use a comma for the separator.

Chaining

  • In the tidyverse, we chain commands by using the pipe %>%. In data.table, we chain commands by adding additional brackets after the brackets we used. Data.table makes this very efficient.

  • Let’s calculate the mean arrival delay for american airlines for each origin/destination pair, then order the results by origin in increasing order, breaking ties by destination in decreasing order.

    ## data.table way
    flights[carrier == "AA", .(ad = mean(arr_delay)), by = .(origin, dest)][order(origin, -dest)]
    
    ## Usual indentation for readability:
    flights[carrier == "AA", .(ad = mean(arr_delay)), by = .(origin, dest)
            ][order(origin, -dest)]
    
    ## Equivalent tidyverse way
    flights_tib %>%
      filter(carrier == "AA") %>%
      group_by(origin, dest) %>%
      summarize(ad = mean(arr_delay)) %>%
      arrange(origin, desc(dest))

Joining

  • We’ll use the following data.tables to introduce joining.

    xdf <- data.table(mykey = c("1", "2", "3"),
                      x_val = c("x1", "x2", "x3"))
    ydf <- data.table(mykey = c("1", "2", "4"),
                      y_val = c("y1", "y2", "y3"))
    xdf
        mykey  x_val
       <char> <char>
    1:      1     x1
    2:      2     x2
    3:      3     x3
    ydf
        mykey  y_val
       <char> <char>
    1:      1     y1
    2:      2     y2
    3:      4     y3

     

  • Use the merge() function for all joining in data.table.

  • Inner Join:

     

    ## data.table way
    merge(xdf, ydf, by = "mykey")
    
    ## equivalent tidyverse way
    inner_join(xdf, ydf, by = "mykey")
  • Outer Joins

     

  • Left Join

    ## data.table way
    merge(xdf, ydf, by = "mykey", all.x = TRUE)
    
    ## equivalent tidyverse way
    left_join(xdf, ydf, by = "mykey")
  • Right Join

    ## data.table way
    merge(xdf, ydf, by = "mykey", all.y = TRUE)
    
    ## equivalent tidyverse way
    right_join(xdf, ydf, by = "mykey")
  • Outer Join

    ## data.table way
    merge(xdf, ydf, by = "mykey", all.x = TRUE, all.y = TRUE)
    
    ## equivalent tidyverse way
    full_join(xdf, ydf, by = "mykey")
  • Binding Rows:

    ## data.table way
    rbind(xdf, ydf, fill = TRUE)
    
    ## equivalent tidyverse way
    bind_rows(xdf, ydf)
  • When you have different key names, use the by.x and by.y arguments instead of the by argument.

    names(ydf)[1] <- "newkey"
    ydf
       newkey  y_val
       <char> <char>
    1:      1     y1
    2:      2     y2
    3:      4     y3
    ## data.table way
    merge(xdf, ydf, by.x = "mykey", by.y = "newkey")
    
    ## equivalent tidyverse way
    inner_join(xdf, ydf, by = c("mykey" = "newkey"))
  • Exercise: Recall the nycflights13 dataset

    library(nycflights13)
    data("flights")
    data("airlines")
    data("planes")
    flights  <- as.data.table(flights)
    airlines <- as.data.table(airlines)
    planes   <- as.data.table(planes)

    Add the full airline names to the flights data.table.

  • Exercise: Select all flights that use a plane where you have some annotation.

Additional References

There is so much more to data.table. You can do quite complicated and sophisticated data manipulations using data.table that would require a lot more typing in the tidyverse. I am not an expert in much of this functionality. A large number of articles and references are given on the homepage: https://rdatatable.gitlab.io/data.table/.