Hi,
I recently stumbled upon the discussion on twitter, concerning the speed of reading data into R. A recent conference talk at rstudio::global, where it was claimed that arrow
could outperform data.table´s fread
function and the following debate on twitter got me interested in testing arrow´s
data reading capabilities vs data.table´s
. This is not a comprehensive benchmark, rather a test for data I used in one of my last project.
Data
We are going to download water level data from Hamburg, St.Pauli, Germany. You can access the data here: https://www.kuestendaten.de/DE/dynamisch/Funktionen/Liste_der_vorhandenen_Daten/index.php.html
You need to input a few parameters here:
- Portal = Portal Tideelbe
- Erhebende Organisation = Hamburg Port Authority (HPA)
- Themenbereich = Alle
- Einzelthema = Wasserstand
Then choose the data for St. Pauli (Wasserstand (Ganglinie)) The downloaded file is a zip file, which you can unzip using something like (using your path to the downloaded data):
unzip(zipfile = file.path("path"), exdir = file.path("path"))
You should get a txt file named directDownload_97.txt, which we are going to read into R using the functions data.table::fread
and arrow::read_delim_arrow
. To make things a little bit easier I deleted the header and the tail of the file by hand.
A first try
Now we load the packages into our R session and set the path to the file you created with unzip
and cleaned by hand.
library(data.table)
library(arrow)
library(microbenchmark)
You need set the path to the unzipped and cleaned file.
path_to <- file.path("D:", "Projekte", "TestsPackages", "data", "directDownload_97.txt")
We define two functions, which are calling the arrow
or data.table
functions.
#Calling arrow
arrow_read <- function() {
read_delim_arrow(file = path_to, delim = "\t",
col_names = FALSE)
}
#Calling data.table
dt_read <- function() {
fread(file = path_to, sep = "\t", header = FALSE)
}
Using the microbenchmark
package, we call the functions five times each and print/plot the results. Each function should be called 3 times as a block.
mb <- microbenchmark::microbenchmark(dt_read(), arrow_read(),
times = 3, control = list(order = "block"))
print(mb)
## Unit: milliseconds
## expr min lq mean median uq max neval
## dt_read() 862.1386 863.7842 867.7227 865.4299 870.5147 875.5995 3
## arrow_read() 805.2859 922.7686 976.3603 1040.2512 1061.8974 1083.5437 3
So, arrow::read_delim_arrow
is super fast here, comparable do data.table::fread
. We take a look at the objects that are being created.
da <- arrow_read()
str(da)
## Classes 'tbl_df', 'tbl' and 'data.frame': 11353967 obs. of 3 variables:
## $ f0: POSIXct, format: "1997-11-01 01:00:00" "1997-11-01 01:10:00" ...
## $ f1: int 329 330 339 354 374 396 419 442 464 484 ...
## $ f2: int NA NA NA NA NA NA NA NA NA NA ...
dt <- dt_read()
str(dt)
## Classes 'data.table' and 'data.frame': 11353967 obs. of 3 variables:
## $ V1: POSIXct, format: "1997-11-01 00:00:00" "1997-11-01 00:10:00" ...
## $ V2: int 329 330 339 354 374 396 419 442 464 484 ...
## $ V3: int NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
arrow_read()
creates a tibble
with three columns (f0, f1, f2). With f0 being a POSIXct
vector, while dt_read()
returns a data.table
, with columns V1, V2, V3.
Setting column classes
We overwrite the functions setting the column classes explicitly.
#Calling arrow
#setting the col_types parameter
arrow_read <- function(col_classes = c("tii") ) {
read_delim_arrow(file = path_to, delim = "\t", col_types = col_classes,
col_names = c("V1", "V2", "V3"))
}
#Calling data.table
#setting the colClasses parameter
dt_read <- function(col_classes = c("POSIXct", "integer", "integer")) {
fread(file = path_to, sep = "\t", header = FALSE, tz = "UTC",
colClasses = col_classes)
}
arrow
and data.table
automatically detect that the first column is a date/time object and reads it as POSIXct
, whereas Let us assume for now that the first column should be read as a character vector. We can set the parameter of the functions as such:
a_c <- c("cii")
d_c <- c("character", "integer", "integer")
mb <- microbenchmark::microbenchmark(dt_read(col_classes = d_c),
arrow_read(col_classes = a_c),
times = 3, control = list(order = "block"))
The results of the benchmark is shown below:
## Unit: seconds
## expr min lq mean median uq
## dt_read(col_classes = d_c) 2.849376 6.334458 8.475382 9.819539 11.288384
## arrow_read(col_classes = a_c) 4.176872 4.260962 4.440585 4.345051 4.572441
## max neval
## 12.757229 3
## 4.799831 3
Threads
Since I am new to the arrow
library, I do not know, how many threads it uses. We can check and set the number of threads for arrow
and data.table
using built-in functions.
arrow::cpu_count()
## [1] 4
data.table::getDTthreads()
## [1] 2
This makes we wonder what happens when we set the threads and compare the reader functions again.
SetThreads <- function(n_threads = 2L) {
arrow::set_cpu_count(num_threads = n_threads)
data.table::setDTthreads(threads = n_threads)
}
mb <- vector(mode = "list", length = 4L)
for(i in 1L:4L) {
SetThreads(n_threads = i)
a_c <- c("cii")
d_c <- c("character", "integer", "integer")
mb[[i]] <- summary(microbenchmark::microbenchmark(dt_read(col_classes = d_c),
arrow_read(col_classes = a_c),
times = 3, control = list(order = "block")))
}
mb <- rbindlist(mb, idcol = "threads")
print(mb)
## threads expr min lq mean median
## 1: 1 dt_read(col_classes = d_c) 3.540304 3.692165 4.439417 3.844026
## 2: 1 arrow_read(col_classes = a_c) 4.198402 4.738828 5.937146 5.279254
## 3: 2 dt_read(col_classes = d_c) 2.824914 2.845855 4.406937 2.866795
## 4: 2 arrow_read(col_classes = a_c) 3.464902 3.890802 4.175166 4.316702
## 5: 3 dt_read(col_classes = d_c) 2.832681 2.854359 2.950912 2.876038
## 6: 3 arrow_read(col_classes = a_c) 3.479970 3.493036 4.243320 3.506102
## 7: 4 dt_read(col_classes = d_c) 2.386013 2.403941 4.573434 2.421869
## 8: 4 arrow_read(col_classes = a_c) 2.816352 2.825118 3.145613 2.833883
## uq max neval
## 1: 4.888974 5.933922 3
## 2: 6.806517 8.333781 3
## 3: 5.197948 7.529101 3
## 4: 4.530299 4.743895 3
## 5: 3.010027 3.144016 3
## 6: 4.624995 5.743888 3
## 7: 5.667145 8.912422 3
## 8: 3.310244 3.786605 3
Wrapping up
I use data.table on a daily basis for my scripts and frequently call fread
to read data into my R session. arrow
is super impressive and performance of both functions is amazing. Thanks to all package developers for enhancing our lifes :-). See you soon!