Hi,
in this post I illustrate how we can generate NA
(missing values) rows using only data.table
functions, where you have a date column (or equivalent) in your data.
We begin by loading the package.
library(data.table)
## Warning: Paket 'data.table' wurde unter R Version 3.6.3 erstellt
Problem
Let us assume you have a data.table
with three columns date
, value
and product
. Here date
is of type IDate, value
& product
are integer. You can think of it as sales data, where you observe daily sales for a certain product. Thus giving you a time series for every product. We assume that we only get a new entry, when we actually observe a sale. Let us look at a minimal example.
(dt <- data.table(date = as.IDate(c("2019-11-01","2019-11-02", "2019-11-04", "2019-11-07")), value = c(3L,4L,1L,8L), product = c(1L,2L,1L,1L)))
## date value product
## 1: 2019-11-01 3 1
## 2: 2019-11-02 4 2
## 3: 2019-11-04 1 1
## 4: 2019-11-07 8 1
sapply(dt, typeof)
## date value product
## "integer" "integer" "integer"
For the date
column you get integer
, because the IDate
class is a integer based date class. Please check ?IDate
for more information.
Since we want a row for every possible date within a certain range, we need to insert rows, where we did not observe anything.
Solution
It is obvious that we have not sold anything on certain days. Now, our job is to insert rows where we write NA
for value
by date
and product
. There are different approaches for this task, we focus on joining. Since we know the start and the end of the time series, we can generate a sequence.
(seq_dates <- seq(as.IDate("2019-11-01"), as.IDate("2019-11-07"), by = "days"))
## [1] "2019-11-01" "2019-11-02" "2019-11-03" "2019-11-04" "2019-11-05"
## [6] "2019-11-06" "2019-11-07"
Now we can use the CJ
function from data.table
to generate a data.table
. We join dt
to the return value of CJ
based on the columns date
and product
dt[CJ(product = unique(product), date = seq_dates), on = c("date", "product")]
## date value product
## 1: 2019-11-01 3 1
## 2: 2019-11-02 NA 1
## 3: 2019-11-03 NA 1
## 4: 2019-11-04 1 1
## 5: 2019-11-05 NA 1
## 6: 2019-11-06 NA 1
## 7: 2019-11-07 8 1
## 8: 2019-11-01 NA 2
## 9: 2019-11-02 4 2
## 10: 2019-11-03 NA 2
## 11: 2019-11-04 NA 2
## 12: 2019-11-05 NA 2
## 13: 2019-11-06 NA 2
## 14: 2019-11-07 NA 2
The CJ
function simply generates a data.table
with all possible combinations of date
and product
.
CJ(date = seq_dates, product = unique(dt$product))
## date product
## 1: 2019-11-01 1
## 2: 2019-11-01 2
## 3: 2019-11-02 1
## 4: 2019-11-02 2
## 5: 2019-11-03 1
## 6: 2019-11-03 2
## 7: 2019-11-04 1
## 8: 2019-11-04 2
## 9: 2019-11-05 1
## 10: 2019-11-05 2
## 11: 2019-11-06 1
## 12: 2019-11-06 2
## 13: 2019-11-07 1
## 14: 2019-11-07 2
Scalability
We check the perfomance of the proposed workflow by generating a data.table
with two parameters: n_days
& n_products
. The code below generates random n_days
within the year 2019 by product
. The value
is simply a random integer. This results in an object with 2.5 million rows and three columns.
seq_dates <- seq(as.IDate("2019-01-01"), as.IDate("2019-12-31"), by = "days")
n_days <- 250L
n_products <- 10000L
dt_large <-data.table(rbindlist(lapply(1:n_products, function(x) {
date <- sample(seq_dates, n_days)
product <- rep.int(x, n_days)
return(list(date = date, product = product))
})), value = sample.int(10, n_products*n_days, replace = TRUE))
sapply(dt_large, typeof)
## date product value
## "integer" "integer" "integer"
Using CJ
we get dt_large_na
with 3.65 million rows (365 days * n_products
) and missing values for value
, when did not sold a product on a certain day.
system.time(dt_large_na <- dt_large[CJ(product = unique(product), date = seq_dates), on = c("date", "product")])
## User System verstrichen
## 0.73 0.09 0.67
dt_large_na
## date product value
## 1: 2019-01-01 1 NA
## 2: 2019-01-02 1 3
## 3: 2019-01-03 1 NA
## 4: 2019-01-04 1 NA
## 5: 2019-01-05 1 1
## ---
## 3649996: 2019-12-27 10000 NA
## 3649997: 2019-12-28 10000 10
## 3649998: 2019-12-29 10000 6
## 3649999: 2019-12-30 10000 8
## 3650000: 2019-12-31 10000 10
For product 1 & 5 the data looks like this:
dt_large_na[product == 1]
## date product value
## 1: 2019-01-01 1 NA
## 2: 2019-01-02 1 3
## 3: 2019-01-03 1 NA
## 4: 2019-01-04 1 NA
## 5: 2019-01-05 1 1
## ---
## 361: 2019-12-27 1 5
## 362: 2019-12-28 1 9
## 363: 2019-12-29 1 10
## 364: 2019-12-30 1 7
## 365: 2019-12-31 1 8
dt_large_na[product == 5]
## date product value
## 1: 2019-01-01 5 5
## 2: 2019-01-02 5 9
## 3: 2019-01-03 5 NA
## 4: 2019-01-04 5 NA
## 5: 2019-01-05 5 4
## ---
## 361: 2019-12-27 5 8
## 362: 2019-12-28 5 5
## 363: 2019-12-29 5 1
## 364: 2019-12-30 5 3
## 365: 2019-12-31 5 NA
Filling missing values
We generated a data.table
with missing values by joining. There are a lot of techniques to fill NA
values. I want to quickly demonstrate how you can use the nafill
function included in data.table
. I introduce a new column in dt_large_na
to demonstrate the functionality.
dt_large_na[, filled_value := nafill(value, "locf"), by = "product"]
This simply fills the NA
values on a product level by last observation carried forward (locf). You might want to use a more sophisticated approach for real data. Please note that nafill
only accepts numeric vectors.
dt_large_na
## date product value filled_value
## 1: 2019-01-01 1 NA NA
## 2: 2019-01-02 1 3 3
## 3: 2019-01-03 1 NA 3
## 4: 2019-01-04 1 NA 3
## 5: 2019-01-05 1 1 1
## ---
## 3649996: 2019-12-27 10000 NA 8
## 3649997: 2019-12-28 10000 10 10
## 3649998: 2019-12-29 10000 6 6
## 3649999: 2019-12-30 10000 8 8
## 3650000: 2019-12-31 10000 10 10
Imputing missing values requires a lot more thought for real applications. A great resource is: https://rmisstastic.netlify.com/how-to/impute/missimp.
See you soon,
Moritz