data.table - non-equi-joins

Hi,

in this post we are going to explore non-equi-joins using the data.table package in R. non-equi-joins can be utilized, when you want to join two tables based on inequalities. We will use the flights data set, which we will download from github using the fread function. From this data set I am going to generate a second table which will be used as a look up table. More on this later. After explaining the join, I will show you a quick workaround for this special case.

prerequisites

You need data.table installed.

library(data.table)

You can then download the data using the code below.

flights <- fread("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")

The data set consists of 253316 rows and 11 columns. Our main focus lies on the columns dep_delay and origin.

flights
##         year month day dep_delay arr_delay carrier origin dest air_time
##      1: 2014     1   1        14        13      AA    JFK  LAX      359
##      2: 2014     1   1        -3        13      AA    JFK  LAX      363
##      3: 2014     1   1         2         9      AA    JFK  LAX      351
##      4: 2014     1   1        -8       -26      AA    LGA  PBI      157
##      5: 2014     1   1         2         1      AA    JFK  LAX      350
##     ---                                                                
## 253312: 2014    10  31         1       -30      UA    LGA  IAH      201
## 253313: 2014    10  31        -5       -14      UA    EWR  IAH      189
## 253314: 2014    10  31        -8        16      MQ    LGA  RDU       83
## 253315: 2014    10  31        -4        15      MQ    LGA  DTW       75
## 253316: 2014    10  31        -5         1      MQ    LGA  SDF      110
##         distance hour
##      1:     2475    9
##      2:     2475   11
##      3:     2475   19
##      4:     1035    7
##      5:     2475   13
##     ---              
## 253312:     1416   14
## 253313:     1400    8
## 253314:      431   11
## 253315:      502   11
## 253316:      659    8

Joining

We then generate a look-up table. We compute the mean and standard deviation of dep_delay grouped by the origin of the flight.

look_up <- flights[, .(mean_dep_delay = mean(dep_delay), sd_dep_delay = sd(dep_delay)), by = "origin"]
look_up
##    origin mean_dep_delay sd_dep_delay
## 1:    JFK       11.44617     41.40737
## 2:    LGA       10.60500     40.11721
## 3:    EWR       15.21248     43.15634

The generated table can now be used to filter the original table flights. We want to filter for the extremes, which are flights with a departure delay (dep_delay) of more than 3 standard deviations away from the mean.

look_up[, mean_sd_delay := mean_dep_delay + 3 * sd_dep_delay]
look_up
##    origin mean_dep_delay sd_dep_delay mean_sd_delay
## 1:    JFK       11.44617     41.40737      135.6683
## 2:    LGA       10.60500     40.11721      130.9566
## 3:    EWR       15.21248     43.15634      144.6815

Using the code below, we can perform the join. The syntax is quite simple, but does not exactly returns what one might expect. The general syntax is x[i, on = .(…)], where x is the flights table and i is the look_up table. The on statement sets the rules for joining both tables together. We already calculated a column mean_sd_delay per origin. Let us assume we are interested in analyzing flights where dep_delay is bigger than mean_sd_delay. For demonstration purposes we join look_up to flights, keeping all rows in flights, which is an equi join using origin first. Now we of course could have simply computed the values using flights only.

flights[look_up, on = .(origin)]
##         year month day dep_delay arr_delay carrier origin dest air_time
##      1: 2014     1   1        14        13      AA    JFK  LAX      359
##      2: 2014     1   1        -3        13      AA    JFK  LAX      363
##      3: 2014     1   1         2         9      AA    JFK  LAX      351
##      4: 2014     1   1         2         1      AA    JFK  LAX      350
##      5: 2014     1   1        -2       -18      AA    JFK  LAX      338
##     ---                                                                
## 253312: 2014    10  31        41        19      UA    EWR  SFO      344
## 253313: 2014    10  31       427       393      UA    EWR  ORD      100
## 253314: 2014    10  31        10       -27      UA    EWR  LAX      326
## 253315: 2014    10  31        18       -14      UA    EWR  LAS      291
## 253316: 2014    10  31        -5       -14      UA    EWR  IAH      189
##         distance hour mean_dep_delay sd_dep_delay mean_sd_delay
##      1:     2475    9       11.44617     41.40737      135.6683
##      2:     2475   11       11.44617     41.40737      135.6683
##      3:     2475   19       11.44617     41.40737      135.6683
##      4:     2475   13       11.44617     41.40737      135.6683
##      5:     2475   21       11.44617     41.40737      135.6683
##     ---                                                        
## 253312:     2565   12       15.21248     43.15634      144.6815
## 253313:      719   21       15.21248     43.15634      144.6815
## 253314:     2454   10       15.21248     43.15634      144.6815
## 253315:     2227   16       15.21248     43.15634      144.6815
## 253316:     1400    8       15.21248     43.15634      144.6815

Actually performing the non-equi-join we simply add to the on statement. Now only rows in flights where the logical vector dep_delay > mean_sd_delay equates to TRUE will be returned. Please note that the dep_delay column is populated with the values of mean_sd_delay.

flights[look_up, on = .(origin, dep_delay > mean_sd_delay)]
##       year month day dep_delay arr_delay carrier origin dest air_time distance
##    1: 2014     1   1  135.6683       133      AA    JFK  LAX      345     2475
##    2: 2014     1   1  135.6683       197      B6    JFK  MCO      154      944
##    3: 2014     1   1  135.6683       161      B6    JFK  PBI      152     1028
##    4: 2014     1   1  135.6683       221      B6    JFK  BTV       48      266
##    5: 2014     1   1  135.6683       178      B6    JFK  RDU       74      427
##   ---                                                                         
## 5566: 2014    10  31  144.6815       130      UA    EWR  ORD       98      719
## 5567: 2014    10  31  144.6815       211      UA    EWR  ORD      116      719
## 5568: 2014    10  31  144.6815       175      UA    EWR  MCO      141      937
## 5569: 2014    10  31  144.6815       289      UA    EWR  ORD      106      719
## 5570: 2014    10  31  144.6815       393      UA    EWR  ORD      100      719
##       hour mean_dep_delay sd_dep_delay
##    1:   19       11.44617     41.40737
##    2:   15       11.44617     41.40737
##    3:   23       11.44617     41.40737
##    4:   16       11.44617     41.40737
##    5:   20       11.44617     41.40737
##   ---                                 
## 5566:   21       15.21248     43.15634
## 5567:   17       15.21248     43.15634
## 5568:   16       15.21248     43.15634
## 5569:   22       15.21248     43.15634
## 5570:   21       15.21248     43.15634

You absolutely need to define the columns using x and i to get the result you want. The next code snippet uses the j argument of data.table to return the desired values. If you use j like this and want to return all columns of x and i you would need to specify every column in x and j.

flights[look_up, .(dep_delay = x.dep_delay, 
                   arr_delay = x.arr_delay, 
                   carrier = x.carrier,
                   mean_sd_delay = i.mean_sd_delay) ,
        on = .(origin, dep_delay > mean_sd_delay)]
##       dep_delay arr_delay carrier mean_sd_delay
##    1:       142       133      AA      135.6683
##    2:       162       197      B6      135.6683
##    3:       151       161      B6      135.6683
##    4:       231       221      B6      135.6683
##    5:       182       178      B6      135.6683
##   ---                                          
## 5566:       169       130      UA      144.6815
## 5567:       154       211      UA      144.6815
## 5568:       176       175      UA      144.6815
## 5569:       316       289      UA      144.6815
## 5570:       427       393      UA      144.6815

We can avoid specifying the columns of x for example by adding and index column in flights, only returning the index of x in the non-equi-join and then perform a join with the flights table. I added a order at the end to allow for a comparison with a different approach.

flights[, index := 1:.N]
(a <- flights[flights[look_up, .(index), on = .(origin, dep_delay > mean_sd_delay)], on = "index"][order(origin)])
##       year month day dep_delay arr_delay carrier origin dest air_time distance
##    1: 2014     1   1       191       185      AA    EWR  DFW      214     1372
##    2: 2014     1   1       297       290      DL    EWR  ATL      114      746
##    3: 2014     1   1       148       146      EV    EWR  AVL      101      583
##    4: 2014     1   1       210       218      EV    EWR  MEM      176      946
##    5: 2014     1   1       260       244      UA    EWR  SEA      339     2402
##   ---                                                                         
## 5566: 2014    10  31       159       218      AA    LGA  ORD      118      733
## 5567: 2014    10  31       244       220      UA    LGA  ORD       98      733
## 5568: 2014    10  31       366       365      UA    LGA  ORD      123      733
## 5569: 2014    10  31       167       138      UA    LGA  ORD      102      733
## 5570: 2014    10  31       271       236      UA    LGA  IAH      195     1416
##       hour  index
##    1:   16     19
##    2:   16    332
##    3:   16    372
##    4:   21    387
##    5:   23    584
##   ---            
## 5566:   17 252812
## 5567:   19 253220
## 5568:   18 253241
## 5569:   19 253246
## 5570:   23 253288

Now this returns the result you might be looking for.

Why the trouble?

There is, in this special case, a much simpler version, avoiding the non-equi join entirely. You simply compute a new column in j and chain a filter afterwards in i and group the table by origin. The order is set to origin to allow for a comparison with a.

(b <- flights[, mean_sd_delay := (mean(dep_delay) + 3 * sd(dep_delay)), by = "origin"][
  dep_delay > mean_sd_delay][order(origin)])
##       year month day dep_delay arr_delay carrier origin dest air_time distance
##    1: 2014     1   1       191       185      AA    EWR  DFW      214     1372
##    2: 2014     1   1       297       290      DL    EWR  ATL      114      746
##    3: 2014     1   1       148       146      EV    EWR  AVL      101      583
##    4: 2014     1   1       210       218      EV    EWR  MEM      176      946
##    5: 2014     1   1       260       244      UA    EWR  SEA      339     2402
##   ---                                                                         
## 5566: 2014    10  31       159       218      AA    LGA  ORD      118      733
## 5567: 2014    10  31       244       220      UA    LGA  ORD       98      733
## 5568: 2014    10  31       366       365      UA    LGA  ORD      123      733
## 5569: 2014    10  31       167       138      UA    LGA  ORD      102      733
## 5570: 2014    10  31       271       236      UA    LGA  IAH      195     1416
##       hour  index mean_sd_delay
##    1:   16     19      144.6815
##    2:   16    332      144.6815
##    3:   16    372      144.6815
##    4:   21    387      144.6815
##    5:   23    584      144.6815
##   ---                          
## 5566:   17 252812      130.9566
## 5567:   19 253220      130.9566
## 5568:   18 253241      130.9566
## 5569:   19 253246      130.9566
## 5570:   23 253288      130.9566
all.equal(a$dep_delay, b$dep_delay)
## [1] TRUE

Please note that the column order and the number of columns differ in a and b, because we introduced an index in a. This post hopefully serves the purpose of providing a quick introduction into non-equi-joins using data.table.

Have a nice day! Moritz

Avatar
Moritz Mueller-Navarra

A Data Scientist using R

Related