This is the first post looking at the data.table presentations from useR!2019 which was held in Toulouse from 2019-07-07 to 2019-10-10. For starters we are going to take a look at the presentations from Arun Srinivasan, one of the main developers of data.table. His talk is called THE #RDATATABLEPACKAGE for fast, flexible and memory efficient data wrangling.
Talk Agenda:
- General data.table syntax
- Explaining .SD & .SDcols
- Optimisation & new functionalities
General data.table syntax
Following code chunk installs the development version of the package, which is needed for some functions explained below.
install.packages("data.table", repos="https://Rdatatable.gitlab.io/data.table")
We start of with a simple expample.
library("data.table")
## Warning: Paket 'data.table' wurde unter R Version 3.6.1 erstellt
We define a data.table DT using the fread
function.
(DT <- fread("id code valA valB
1 c 1 10
1 b 2 11
1 c 3 12
1 c 4 13
2 a 5 14
2 a 6 15
1 b 7 16
2 a 8 17
1 c 9 18"))
## id code valA valB
## 1: 1 c 1 10
## 2: 1 b 2 11
## 3: 1 c 3 12
## 4: 1 c 4 13
## 5: 2 a 5 14
## 6: 2 a 6 15
## 7: 1 b 7 16
## 8: 2 a 8 17
## 9: 1 c 9 18
The general data.table syntax is DT[i, j, by]
, which you can translate into: i = on which rows, j = what to do (columns), grouped by what. So when we want to get the sum of valA
, we can simply do.
DT[, sum(valA)]
## [1] 45
We have not defined a i
nor by
argument here. This simply means we want to compute for all rows and grouping is not relevant. Now we group by id
.
DT[, sum(valA), by = id]
## id V1
## 1: 1 26
## 2: 2 19
We can also name the outcome of the column directly. The .()
is an alias for list
.
DT[, .(sumA = sum(valA)), by = id]
## id sumA
## 1: 1 26
## 2: 2 19
When you pass a column to the by
argument, internally the number of the unique values for that column gets evaluated. Afterwards the rows are associated to the respective group. Then the expression provided in j
, sum(valA)
gets evaluated for the groups.
This is a very simple example, but illustrates the general form quite nicely. Arun introduces another simple example by passing an expression to the i
argument.
DT[code != "b", .(sumA = sum(valA)), by = id]
## id sumA
## 1: 1 17
## 2: 2 19
Here we simply state that code
should be not equal to “b”. Very easy and logical syntax.
Explaining .SD & .SDcols
We are still operating on DT.
DT
## id code valA valB
## 1: 1 c 1 10
## 2: 1 b 2 11
## 3: 1 c 3 12
## 4: 1 c 4 13
## 5: 2 a 5 14
## 6: 2 a 6 15
## 7: 1 b 7 16
## 8: 2 a 8 17
## 9: 1 c 9 18
.SD
stands for Subset of Data. Together with .SDcols
we can use it to simply subset DT
by column names and then compute on this subset.
You could do this:
DT[, .SD, .SDcols = "valA"]
## valA
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## 6: 6
## 7: 7
## 8: 8
## 9: 9
This returns a data.table
object with one column valA
. You can of course pass more columns.
DT[, .SD, .SDcols = c("valA", "valB")]
## valA valB
## 1: 1 10
## 2: 2 11
## 3: 3 12
## 4: 4 13
## 5: 5 14
## 6: 6 15
## 7: 7 16
## 8: 8 17
## 9: 9 18
You can also use the patterns
function in .SDcols
.
DT[, .SD, .SDcols = patterns("^val")]
## valA valB
## 1: 1 10
## 2: 2 11
## 3: 3 12
## 4: 4 13
## 5: 5 14
## 6: 6 15
## 7: 7 16
## 8: 8 17
## 9: 9 18
How can we use this functionality to compute the sum of valA
and valB
grouped by id
where code != "b
?
Easy :-)
DT[code != "b", lapply(.SD, sum), .SDcols = patterns("^val"), by = "id"]
## id valA valB
## 1: 1 17 53
## 2: 2 19 46
The patterns
function only returns columns starting with val.
.SD
contains all the columns, except for the grouping columns supplied in by
, for each group.
DT[code != "b", print(.SD), .SDcols = c("code","valA", "valB"), by = id]
## code valA valB
## 1: c 1 10
## 2: c 3 12
## 3: c 4 13
## 4: c 9 18
## code valA valB
## 1: a 5 14
## 2: a 6 15
## 3: a 8 17
## Empty data.table (0 rows and 1 cols): id
Once you grouped by id
only the columns valA
and valB
are being considered due to .SDcols
. We therefore compute the sums on these columns and not on code
.
DT[code != "b", print(.SD), .SDcols = c("valA", "valB"), by = id]
## valA valB
## 1: 1 10
## 2: 3 12
## 3: 4 13
## 4: 9 18
## valA valB
## 1: 5 14
## 2: 6 15
## 3: 8 17
## Empty data.table (0 rows and 1 cols): id
So how does this lapply(.SD, sum)
work? lapply
is the infamous base function which returns a list. We operate on .SD
and compute the sum of every column in .SD
. We take a step back and look at the class
and typeof
of DT
class(DT)
## [1] "data.table" "data.frame"
typeof(DT)
## [1] "list"
typeof(DT[,.SD])
## [1] "list"
You find more information about .SD
in this stackoverflow post:
https://stackoverflow.com/questions/8508482/what-does-sd-stand-for-in-data-table-in-r
You can also check vignette("datatable-sd-usage")
in the development version.
Optimisation & new functionalities
Arun presents optimisation principles for operations in i
, j
& by
.
Optimisation in “i”
We create a data.table
which has 200 million rows and two columns x
and y
.
dt <- data.table(x = sample(1e5, 2e8, TRUE), y = runif(2e8))
Now we only return rows where the value of x
is in 1000:2000
, we operate in i
.
system.time(dt[x %in% 1000:2000])
system.time(dt[x %in% 1000:2000])
This takes a while (~17s on my laptop), but running the expression a second time results in a substantial speed up (~4s total). Please note that this not a proper benchmark. The second time we call this the expression, it reuses the index created in the first run. This currently works for ==
and %in%
and is called Auto indexing. The index is stored within the data.table
as an attribute.
Now we create a data.table
with 20 columns and 50 million rows to demonstrate that expressions in i
can run in parallel.
dt <- setDT(lapply(1:20, function(x) sample(100, 5e7, TRUE)))
We operate in i
to return the rows, where V1 > 50L
. The column is processed in parallel.
system.time(dt[V1 > 50L])
You can set the threads data.table
uses with and check the differences.
setDTthreads(threads = 1L)
system.time(dt[V1 > 50L])
setDTTthreads(threads = 2L)
Optimisation in “by”
Recently the radix order has been parallelised. We define a new data set (200 million rows, 2 cols, ~3GB).
dt <- data.table(x=sample(1e5, 2e8, TRUE), y=runif(2e8))
Now we run this expression, which returns the occurences grouped by x
.
dt[, .N, by = x]
Optimisation in “j”
In j
certain functions are optimized. Arun mentions mean
, median
, sum
, min
, max
, head
, tail
.
We can see a difference, when comparing base::mean
with just calling mean
in j
.
dt[, lapply(.SD, base::mean), by=V1]
dt[, lapply(.SD, mean), by=V1]
Internally this expression uses the function mean
implemented by data.table
in C.
New functionalities
The first function we are going to look at is nafill
. This function allows us to fill NA
values in a data.table
by a specific method, like last observation carried forward (locf). See ?data.table::nafill
for more information.
nafill
nafill
is a function that fills missing values in a data.table
. We define DT
as follows.
(DT <- fread("V1 V2 V3 V4
1 c NA 10
2 b 2 NA
NA c 3 NA
1 NA 4 NA
2 NA 5 14"))
## V1 V2 V3 V4
## 1: 1 c NA 10
## 2: 2 b 2 NA
## 3: NA c 3 NA
## 4: 1 <NA> 4 NA
## 5: 2 <NA> 5 14
Now let us try to apply nafill
to this object. The function takes in three important arguments:
nafill(x, type=c("const","locf","nocb"), fill=NA,
verbose=getOption("datatable.verbose"))
x
can be a vector, list, data.frame or data.table. The type
defines how the missing values should be filled.
locf
-> last observation carried forwardnocb
-> next observation carried backwardsconst
-> a constant
#(DT <- nafill(DT, "locf"))
Above code will throw an error indicating that we can only fill columns with type numeric
(double & integer). Here the setnafill
functions comes in handy:
#getting numeric cols
numeric_cols <- names(DT)[DT[, sapply(.SD, is.numeric)]]
setnafill(DT, "locf", cols = numeric_cols)
DT[]
## V1 V2 V3 V4
## 1: 1 c NA 10
## 2: 2 b 2 10
## 3: 2 c 3 10
## 4: 1 <NA> 4 10
## 5: 2 <NA> 5 14
This updates DT
by reference.
frollmean & frollsum
Now we are going to explore the frollmean
and frollsum
functions. These functions allow us to compute rolling means and rolling sums.
DT[, lapply(.SD, frollmean, 3), .SDcols = numeric_cols]
## V1 V3 V4
## 1: NA NA NA
## 2: NA NA NA
## 3: 1.666667 NA 10.00000
## 4: 1.666667 3 10.00000
## 5: 1.666667 4 11.33333
#see also ?frollsum
DT[, lapply(.SD, frollsum, 3), .SDcols = numeric_cols]
## V1 V3 V4
## 1: NA NA NA
## 2: NA NA NA
## 3: 5 NA 30
## 4: 5 9 30
## 5: 5 12 34
coalesce
The fcoalesce
function fills missing values in a vector by trying to pull values from one or more candidate vectors.
x <- c(11L, NA, 13L, NA, 15L, NA)
y <- c(NA, 12L, 5L, NA, NA, NA)
z <- c(11L, NA, 1L, 14L, NA, NA)
fcoalesce(x, y, z)
## [1] 11 12 13 14 15 NA
Given above syntax fcoalesce
tries to fill missing values in x
by looking at the corresponding elements of y
& z
. At the second index we find a missing value in x
, whereas y = 12L ; z = NA
. At the fourth index we find y = NA ; z = 14L
. So the new vector becomes 11 12 13 14 15 NA
.
Helpful links
See you soon!