data.table - Generate and fill missing values.

Last updated on Nov 4, 2019 5 min read

Hi,

in this post I illustrate how we can generate NA (missing values) rows using only data.table functions, where you have a date column (or equivalent) in your data.
We begin by loading the package.

library(data.table)

## Warning: Paket 'data.table' wurde unter R Version 3.6.3 erstellt

Problem

Let us assume you have a data.table with three columns date, value and product. Here date is of type IDate, value & product are integer. You can think of it as sales data, where you observe daily sales for a certain product. Thus giving you a time series for every product. We assume that we only get a new entry, when we actually observe a sale. Let us look at a minimal example.

(dt <- data.table(date = as.IDate(c("2019-11-01","2019-11-02", "2019-11-04", "2019-11-07")), value = c(3L,4L,1L,8L), product = c(1L,2L,1L,1L)))

##          date value product
## 1: 2019-11-01     3       1
## 2: 2019-11-02     4       2
## 3: 2019-11-04     1       1
## 4: 2019-11-07     8       1

sapply(dt, typeof)

##      date     value   product 
## "integer" "integer" "integer"

For the date column you get integer, because the IDate class is a integer based date class. Please check ?IDate for more information.

Since we want a row for every possible date within a certain range, we need to insert rows, where we did not observe anything.

Solution

It is obvious that we have not sold anything on certain days. Now, our job is to insert rows where we write NA for value by date and product. There are different approaches for this task, we focus on joining. Since we know the start and the end of the time series, we can generate a sequence.

(seq_dates <- seq(as.IDate("2019-11-01"), as.IDate("2019-11-07"), by = "days"))

## [1] "2019-11-01" "2019-11-02" "2019-11-03" "2019-11-04" "2019-11-05"
## [6] "2019-11-06" "2019-11-07"

Now we can use the CJ function from data.table to generate a data.table. We join dt to the return value of CJ based on the columns date and product

dt[CJ(product = unique(product), date = seq_dates), on = c("date", "product")]

##           date value product
##  1: 2019-11-01     3       1
##  2: 2019-11-02    NA       1
##  3: 2019-11-03    NA       1
##  4: 2019-11-04     1       1
##  5: 2019-11-05    NA       1
##  6: 2019-11-06    NA       1
##  7: 2019-11-07     8       1
##  8: 2019-11-01    NA       2
##  9: 2019-11-02     4       2
## 10: 2019-11-03    NA       2
## 11: 2019-11-04    NA       2
## 12: 2019-11-05    NA       2
## 13: 2019-11-06    NA       2
## 14: 2019-11-07    NA       2

The CJ function simply generates a data.table with all possible combinations of date and product.

CJ(date = seq_dates,  product = unique(dt$product))

##           date product
##  1: 2019-11-01       1
##  2: 2019-11-01       2
##  3: 2019-11-02       1
##  4: 2019-11-02       2
##  5: 2019-11-03       1
##  6: 2019-11-03       2
##  7: 2019-11-04       1
##  8: 2019-11-04       2
##  9: 2019-11-05       1
## 10: 2019-11-05       2
## 11: 2019-11-06       1
## 12: 2019-11-06       2
## 13: 2019-11-07       1
## 14: 2019-11-07       2

Scalability

We check the perfomance of the proposed workflow by generating a data.table with two parameters: n_days & n_products. The code below generates random n_days within the year 2019 by product. The value is simply a random integer. This results in an object with 2.5 million rows and three columns.

seq_dates  <- seq(as.IDate("2019-01-01"), as.IDate("2019-12-31"), by = "days")
n_days     <- 250L
n_products <- 10000L

dt_large <-data.table(rbindlist(lapply(1:n_products, function(x) {
  date    <- sample(seq_dates, n_days)
  product <- rep.int(x, n_days)
  return(list(date = date, product = product))
})), value = sample.int(10, n_products*n_days, replace = TRUE))

sapply(dt_large, typeof)

##      date   product     value 
## "integer" "integer" "integer"

Using CJ we get dt_large_na with 3.65 million rows (365 days * n_products) and missing values for value, when did not sold a product on a certain day.

system.time(dt_large_na <- dt_large[CJ(product = unique(product), date = seq_dates), on = c("date", "product")])

##        User      System verstrichen 
##        0.73        0.09        0.67

dt_large_na

##                date product value
##       1: 2019-01-01       1    NA
##       2: 2019-01-02       1     3
##       3: 2019-01-03       1    NA
##       4: 2019-01-04       1    NA
##       5: 2019-01-05       1     1
##      ---                         
## 3649996: 2019-12-27   10000    NA
## 3649997: 2019-12-28   10000    10
## 3649998: 2019-12-29   10000     6
## 3649999: 2019-12-30   10000     8
## 3650000: 2019-12-31   10000    10

For product 1 & 5 the data looks like this:

dt_large_na[product == 1]

##            date product value
##   1: 2019-01-01       1    NA
##   2: 2019-01-02       1     3
##   3: 2019-01-03       1    NA
##   4: 2019-01-04       1    NA
##   5: 2019-01-05       1     1
##  ---                         
## 361: 2019-12-27       1     5
## 362: 2019-12-28       1     9
## 363: 2019-12-29       1    10
## 364: 2019-12-30       1     7
## 365: 2019-12-31       1     8

dt_large_na[product == 5]

##            date product value
##   1: 2019-01-01       5     5
##   2: 2019-01-02       5     9
##   3: 2019-01-03       5    NA
##   4: 2019-01-04       5    NA
##   5: 2019-01-05       5     4
##  ---                         
## 361: 2019-12-27       5     8
## 362: 2019-12-28       5     5
## 363: 2019-12-29       5     1
## 364: 2019-12-30       5     3
## 365: 2019-12-31       5    NA

Filling missing values

We generated a data.table with missing values by joining. There are a lot of techniques to fill NA values. I want to quickly demonstrate how you can use the nafill function included in data.table. I introduce a new column in dt_large_na to demonstrate the functionality.

dt_large_na[, filled_value := nafill(value, "locf"), by = "product"]

This simply fills the NA values on a product level by last observation carried forward (locf). You might want to use a more sophisticated approach for real data. Please note that nafill only accepts numeric vectors.

dt_large_na

##                date product value filled_value
##       1: 2019-01-01       1    NA           NA
##       2: 2019-01-02       1     3            3
##       3: 2019-01-03       1    NA            3
##       4: 2019-01-04       1    NA            3
##       5: 2019-01-05       1     1            1
##      ---                                      
## 3649996: 2019-12-27   10000    NA            8
## 3649997: 2019-12-28   10000    10           10
## 3649998: 2019-12-29   10000     6            6
## 3649999: 2019-12-30   10000     8            8
## 3650000: 2019-12-31   10000    10           10

Imputing missing values requires a lot more thought for real applications. A great resource is: https://rmisstastic.netlify.com/how-to/impute/missimp.

See you soon,

Moritz

Moritz Mueller-Navarra

A Data Scientist using R