data.table - useR!2019 I

This is the first post looking at the data.table presentations from useR!2019 which was held in Toulouse from 2019-07-07 to 2019-10-10. For starters we are going to take a look at the presentations from Arun Srinivasan, one of the main developers of data.table. His talk is called THE #RDATATABLEPACKAGE for fast, flexible and memory efficient data wrangling.

Talk Agenda:

  • General data.table syntax
  • Explaining .SD & .SDcols
  • Optimisation & new functionalities

General data.table syntax

Following code chunk installs the development version of the package, which is needed for some functions explained below.

install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")

We start of with a simple expample.

library("data.table")
## Warning: Paket 'data.table' wurde unter R Version 3.6.1 erstellt

We define a data.table DT using the fread function.

(DT <- fread("id code valA valB
1 c 1 10
1 b 2 11
1 c 3 12
1 c 4 13
2 a 5 14
2 a 6 15
1 b 7 16
2 a 8 17
1 c 9 18"))
##    id code valA valB
## 1:  1    c    1   10
## 2:  1    b    2   11
## 3:  1    c    3   12
## 4:  1    c    4   13
## 5:  2    a    5   14
## 6:  2    a    6   15
## 7:  1    b    7   16
## 8:  2    a    8   17
## 9:  1    c    9   18

The general data.table syntax is DT[i, j, by], which you can translate into: i = on which rows, j = what to do (columns), grouped by what. So when we want to get the sum of valA, we can simply do.

DT[, sum(valA)]
## [1] 45

We have not defined a i nor by argument here. This simply means we want to compute for all rows and grouping is not relevant. Now we group by id.

DT[, sum(valA), by = id]
##    id V1
## 1:  1 26
## 2:  2 19

We can also name the outcome of the column directly. The .() is an alias for list.

DT[, .(sumA = sum(valA)), by = id]
##    id sumA
## 1:  1   26
## 2:  2   19

When you pass a column to the by argument, internally the number of the unique values for that column gets evaluated. Afterwards the rows are associated to the respective group. Then the expression provided in j, sum(valA) gets evaluated for the groups.

This is a very simple example, but illustrates the general form quite nicely. Arun introduces another simple example by passing an expression to the i argument.

DT[code != "b", .(sumA = sum(valA)), by = id]
##    id sumA
## 1:  1   17
## 2:  2   19

Here we simply state that code should be not equal to “b”. Very easy and logical syntax.

Explaining .SD & .SDcols

We are still operating on DT.

DT
##    id code valA valB
## 1:  1    c    1   10
## 2:  1    b    2   11
## 3:  1    c    3   12
## 4:  1    c    4   13
## 5:  2    a    5   14
## 6:  2    a    6   15
## 7:  1    b    7   16
## 8:  2    a    8   17
## 9:  1    c    9   18

.SD stands for Subset of Data. Together with .SDcols we can use it to simply subset DT by column names and then compute on this subset.

You could do this:

DT[, .SD, .SDcols = "valA"]
##    valA
## 1:    1
## 2:    2
## 3:    3
## 4:    4
## 5:    5
## 6:    6
## 7:    7
## 8:    8
## 9:    9

This returns a data.table object with one column valA. You can of course pass more columns.

DT[, .SD, .SDcols = c("valA", "valB")]
##    valA valB
## 1:    1   10
## 2:    2   11
## 3:    3   12
## 4:    4   13
## 5:    5   14
## 6:    6   15
## 7:    7   16
## 8:    8   17
## 9:    9   18

You can also use the patterns function in .SDcols.

DT[, .SD, .SDcols = patterns("^val")]
##    valA valB
## 1:    1   10
## 2:    2   11
## 3:    3   12
## 4:    4   13
## 5:    5   14
## 6:    6   15
## 7:    7   16
## 8:    8   17
## 9:    9   18

How can we use this functionality to compute the sum of valA and valB grouped by id where code != "b ? Easy :-)

DT[code != "b", lapply(.SD, sum), .SDcols = patterns("^val"), by = "id"]
##    id valA valB
## 1:  1   17   53
## 2:  2   19   46

The patterns function only returns columns starting with val. .SD contains all the columns, except for the grouping columns supplied in by, for each group.

DT[code != "b", print(.SD), .SDcols = c("code","valA", "valB"), by = id]
##    code valA valB
## 1:    c    1   10
## 2:    c    3   12
## 3:    c    4   13
## 4:    c    9   18
##    code valA valB
## 1:    a    5   14
## 2:    a    6   15
## 3:    a    8   17
## Empty data.table (0 rows and 1 cols): id

Once you grouped by id only the columns valA and valB are being considered due to .SDcols. We therefore compute the sums on these columns and not on code.

DT[code != "b", print(.SD), .SDcols = c("valA", "valB"), by = id]
##    valA valB
## 1:    1   10
## 2:    3   12
## 3:    4   13
## 4:    9   18
##    valA valB
## 1:    5   14
## 2:    6   15
## 3:    8   17
## Empty data.table (0 rows and 1 cols): id

So how does this lapply(.SD, sum) work? lapply is the infamous base function which returns a list. We operate on .SD and compute the sum of every column in .SD. We take a step back and look at the class and typeof of DT

class(DT)
## [1] "data.table" "data.frame"
typeof(DT)
## [1] "list"
typeof(DT[,.SD])
## [1] "list"

You find more information about .SD in this stackoverflow post:

https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r

You can also check vignette("datatable-sd-usage") in the development version.

Optimisation & new functionalities

Arun presents optimisation principles for operations in i, j & by.

Optimisation in “i”

We create a data.table which has 200 million rows and two columns x and y.

dt <- data.table(x = sample(1e5, 2e8, TRUE), y = runif(2e8)) 

Now we only return rows where the value of x is in 1000:2000, we operate in i.

system.time(dt[x %in% 1000:2000])
system.time(dt[x %in% 1000:2000])

This takes a while (~17s on my laptop), but running the expression a second time results in a substantial speed up (~4s total). Please note that this not a proper benchmark. The second time we call this the expression, it reuses the index created in the first run. This currently works for == and %in% and is called Auto indexing. The index is stored within the data.table as an attribute.

Now we create a data.table with 20 columns and 50 million rows to demonstrate that expressions in i can run in parallel.

dt <- setDT(lapply(1:20, function(x) sample(100, 5e7, TRUE))) 

We operate in i to return the rows, where V1 > 50L. The column is processed in parallel.

system.time(dt[V1 > 50L])

You can set the threads data.table uses with and check the differences.

setDTthreads(threads = 1L)
system.time(dt[V1 > 50L])
setDTTthreads(threads = 2L)

Optimisation in “by”

Recently the radix order has been parallelised. We define a new data set (200 million rows, 2 cols, ~3GB).

dt <- data.table(x=sample(1e5, 2e8, TRUE), y=runif(2e8))

Now we run this expression, which returns the occurences grouped by x.

dt[, .N, by = x]

Optimisation in “j”

In j certain functions are optimized. Arun mentions mean, median, sum, min, max, head, tail. We can see a difference, when comparing base::mean with just calling mean in j.

dt[, lapply(.SD, base::mean), by=V1]
dt[, lapply(.SD, mean), by=V1]

Internally this expression uses the function mean implemented by data.table in C.

New functionalities

The first function we are going to look at is nafill. This function allows us to fill NA values in a data.table by a specific method, like last observation carried forward (locf). See ?data.table::nafill for more information.

nafill

nafill is a function that fills missing values in a data.table. We define DT as follows.

(DT <- fread("V1 V2 V3 V4
1 c NA 10
2 b 2 NA
NA c 3 NA
1 NA 4 NA
2 NA 5 14"))
##    V1   V2 V3 V4
## 1:  1    c NA 10
## 2:  2    b  2 NA
## 3: NA    c  3 NA
## 4:  1 <NA>  4 NA
## 5:  2 <NA>  5 14

Now let us try to apply nafill to this object. The function takes in three important arguments:

nafill(x, type=c("const","locf","nocb"), fill=NA,
       verbose=getOption("datatable.verbose"))

x can be a vector, list, data.frame or data.table. The type defines how the missing values should be filled.

  • locf -> last observation carried forward
  • nocb -> next observation carried backwards
  • const -> a constant
#(DT <- nafill(DT, "locf"))

Above code will throw an error indicating that we can only fill columns with type numeric (double & integer). Here the setnafill functions comes in handy:

#getting numeric cols

numeric_cols <- names(DT)[DT[, sapply(.SD, is.numeric)]]

setnafill(DT, "locf", cols = numeric_cols)
DT[]
##    V1   V2 V3 V4
## 1:  1    c NA 10
## 2:  2    b  2 10
## 3:  2    c  3 10
## 4:  1 <NA>  4 10
## 5:  2 <NA>  5 14

This updates DT by reference.

frollmean & frollsum

Now we are going to explore the frollmean and frollsum functions. These functions allow us to compute rolling means and rolling sums.

DT[, lapply(.SD, frollmean, 3), .SDcols = numeric_cols]
##          V1 V3       V4
## 1:       NA NA       NA
## 2:       NA NA       NA
## 3: 1.666667 NA 10.00000
## 4: 1.666667  3 10.00000
## 5: 1.666667  4 11.33333
#see also ?frollsum
DT[, lapply(.SD, frollsum, 3), .SDcols = numeric_cols]
##    V1 V3 V4
## 1: NA NA NA
## 2: NA NA NA
## 3:  5 NA 30
## 4:  5  9 30
## 5:  5 12 34

coalesce

The fcoalesce function fills missing values in a vector by trying to pull values from one or more candidate vectors.

x <- c(11L, NA, 13L, NA, 15L, NA)
y <- c(NA, 12L, 5L, NA, NA, NA)
z <- c(11L, NA, 1L, 14L, NA, NA)

fcoalesce(x, y, z)
## [1] 11 12 13 14 15 NA

Given above syntax fcoalesce tries to fill missing values in x by looking at the corresponding elements of y & z. At the second index we find a missing value in x, whereas y = 12L ; z = NA. At the fourth index we find y = NA ; z = 14L. So the new vector becomes 11 12 13 14 15 NA.

Avatar
Moritz Mueller-Navarra

A Data Scientist using R