Reading Data - data.table & arrow

Last updated on Feb 22, 2021 5 min read

Hi, I recently stumbled upon the discussion on twitter, concerning the speed of reading data into R. A recent conference talk at rstudio::global, where it was claimed that arrow could outperform data.table´s fread function and the following debate on twitter got me interested in testing arrow´s data reading capabilities vs data.table´s. This is not a comprehensive benchmark, rather a test for data I used in one of my last project.

Data

We are going to download water level data from Hamburg, St.Pauli, Germany. You can access the data here: https://www.kuestendaten.de/DE/dynamisch/Funktionen/Liste_der_vorhandenen_Daten/index.php.html

You need to input a few parameters here:

Portal = Portal Tideelbe
Erhebende Organisation = Hamburg Port Authority (HPA)
Themenbereich = Alle
Einzelthema = Wasserstand

Then choose the data for St. Pauli (Wasserstand (Ganglinie)) The downloaded file is a zip file, which you can unzip using something like (using your path to the downloaded data):

unzip(zipfile = file.path("path"), exdir = file.path("path"))

You should get a txt file named directDownload_97.txt, which we are going to read into R using the functions data.table::fread and arrow::read_delim_arrow. To make things a little bit easier I deleted the header and the tail of the file by hand.

A first try

Now we load the packages into our R session and set the path to the file you created with unzip and cleaned by hand.

library(data.table)
library(arrow)
library(microbenchmark)

You need set the path to the unzipped and cleaned file.

path_to <- file.path("D:", "Projekte", "TestsPackages", "data", "directDownload_97.txt")

We define two functions, which are calling the arrow or data.table functions.

#Calling arrow
arrow_read <- function() {
  read_delim_arrow(file = path_to, delim = "\t", 
                   col_names = FALSE)
}
#Calling data.table
dt_read    <- function() {
  fread(file = path_to, sep = "\t", header = FALSE)
}

Using the microbenchmark package, we call the functions five times each and print/plot the results. Each function should be called 3 times as a block.

mb <- microbenchmark::microbenchmark(dt_read(), arrow_read(),  
                                     times = 3, control = list(order = "block"))

print(mb)

## Unit: milliseconds
##          expr      min       lq     mean    median        uq       max neval
##     dt_read() 862.1386 863.7842 867.7227  865.4299  870.5147  875.5995     3
##  arrow_read() 805.2859 922.7686 976.3603 1040.2512 1061.8974 1083.5437     3

So, arrow::read_delim_arrow is super fast here, comparable do data.table::fread. We take a look at the objects that are being created.

da <- arrow_read()
str(da)

## Classes 'tbl_df', 'tbl' and 'data.frame':    11353967 obs. of  3 variables:
##  $ f0: POSIXct, format: "1997-11-01 01:00:00" "1997-11-01 01:10:00" ...
##  $ f1: int  329 330 339 354 374 396 419 442 464 484 ...
##  $ f2: int  NA NA NA NA NA NA NA NA NA NA ...

dt <- dt_read()
str(dt)

## Classes 'data.table' and 'data.frame':   11353967 obs. of  3 variables:
##  $ V1: POSIXct, format: "1997-11-01 00:00:00" "1997-11-01 00:10:00" ...
##  $ V2: int  329 330 339 354 374 396 419 442 464 484 ...
##  $ V3: int  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

arrow_read() creates a tibble with three columns (f0, f1, f2). With f0 being a POSIXct vector, while dt_read() returns a data.table, with columns V1, V2, V3.

Setting column classes

We overwrite the functions setting the column classes explicitly.

#Calling arrow
#setting the col_types parameter
arrow_read <- function(col_classes = c("tii") ) {
  read_delim_arrow(file = path_to, delim = "\t", col_types = col_classes, 
                   col_names = c("V1", "V2", "V3"))
}
#Calling data.table
#setting the colClasses parameter
dt_read    <- function(col_classes = c("POSIXct", "integer", "integer")) {
  fread(file = path_to, sep = "\t", header = FALSE, tz = "UTC",
                    colClasses = col_classes)
}

arrow and data.table automatically detect that the first column is a date/time object and reads it as POSIXct, whereas Let us assume for now that the first column should be read as a character vector. We can set the parameter of the functions as such:

a_c <- c("cii")
d_c <- c("character", "integer", "integer")

mb <- microbenchmark::microbenchmark(dt_read(col_classes = d_c),
                                     arrow_read(col_classes = a_c),
                                     times = 3, control = list(order = "block"))

The results of the benchmark is shown below:

## Unit: seconds
##                           expr      min       lq     mean   median        uq
##     dt_read(col_classes = d_c) 2.849376 6.334458 8.475382 9.819539 11.288384
##  arrow_read(col_classes = a_c) 4.176872 4.260962 4.440585 4.345051  4.572441
##        max neval
##  12.757229     3
##   4.799831     3

Threads

Since I am new to the arrow library, I do not know, how many threads it uses. We can check and set the number of threads for arrow and data.table using built-in functions.

arrow::cpu_count()

## [1] 4

data.table::getDTthreads()

## [1] 2

This makes we wonder what happens when we set the threads and compare the reader functions again.

SetThreads <- function(n_threads = 2L) {
  arrow::set_cpu_count(num_threads = n_threads)
  data.table::setDTthreads(threads = n_threads)
}

mb <- vector(mode = "list", length = 4L)

for(i in 1L:4L) {
  SetThreads(n_threads = i)

a_c <- c("cii")
d_c <- c("character", "integer", "integer")

mb[[i]] <- summary(microbenchmark::microbenchmark(dt_read(col_classes = d_c),
                                                  arrow_read(col_classes = a_c),
                                     times = 3, control = list(order = "block")))

}

mb <- rbindlist(mb, idcol = "threads")

print(mb)

##    threads                          expr      min       lq     mean   median
## 1:       1    dt_read(col_classes = d_c) 3.540304 3.692165 4.439417 3.844026
## 2:       1 arrow_read(col_classes = a_c) 4.198402 4.738828 5.937146 5.279254
## 3:       2    dt_read(col_classes = d_c) 2.824914 2.845855 4.406937 2.866795
## 4:       2 arrow_read(col_classes = a_c) 3.464902 3.890802 4.175166 4.316702
## 5:       3    dt_read(col_classes = d_c) 2.832681 2.854359 2.950912 2.876038
## 6:       3 arrow_read(col_classes = a_c) 3.479970 3.493036 4.243320 3.506102
## 7:       4    dt_read(col_classes = d_c) 2.386013 2.403941 4.573434 2.421869
## 8:       4 arrow_read(col_classes = a_c) 2.816352 2.825118 3.145613 2.833883
##          uq      max neval
## 1: 4.888974 5.933922     3
## 2: 6.806517 8.333781     3
## 3: 5.197948 7.529101     3
## 4: 4.530299 4.743895     3
## 5: 3.010027 3.144016     3
## 6: 4.624995 5.743888     3
## 7: 5.667145 8.912422     3
## 8: 3.310244 3.786605     3

Wrapping up

I use data.table on a daily basis for my scripts and frequently call fread to read data into my R session. arrow is super impressive and performance of both functions is amazing. Thanks to all package developers for enhancing our lifes :-). See you soon!

data.table arrow

Moritz Mueller-Navarra

A Data Scientist using R