Joining various data sets can be done with many packages, but to the best of my knowledge this could be done with package data.table . The syntax is clear and the computational speed is just amazing.

## The former merge() and Reduce() solutions with base R:

```
df1 <- data.frame(Customer_ID = c(1, 2, 3), Name = c("John", "Sue", "Ann"))
df2 <- data.frame(Customer_ID = c(1, 3), Year_First_Order = c(2011, 2017))
df3 <- data.frame(Customer_ID = c(1, 2, 3),
Date_Last_Order = c("2017-03-03", "2014-03-01", "2017-05-30"),
No_Items_Last_Order = c(3, 1, 1),
Total_Amount_Last_Order = c(49, 25,25))
df4 <- data.frame(Customer_ID = c(2, 3), Interested_In_Promo = c(TRUE, FALSE))
```

We could first join the first two data frames, then join the result with the third, and then join the results of that with the fourth table. In the following, a LEFT OUTER join is done using base `R`

:

```
df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
```

```
## Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1 1 John 2011 2017-03-03 3
## 2 2 Sue NA 2014-03-01 1
## 3 3 Ann 2017 2017-05-30 1
## Total_Amount_Last_Order Interested_In_Promo
## 1 49 NA
## 2 25 TRUE
## 3 25 FALSE
```

When all the data frames are stored in one list, `Reduce()`

can be applied in the following manner:

```
df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, all.y = FALSE), df_list)
```

```
## Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1 1 John 2011 2017-03-03 3
## 2 2 Sue NA 2014-03-01 1
## 3 3 Ann 2017 2017-05-30 1
## Total_Amount_Last_Order Interested_In_Promo
## 1 49 NA
## 2 25 TRUE
## 3 25 FALSE
```

But is this **computationally efficient**?

We need a larger data set to get some insights.

```
n <- 300000
df1 <- data.frame(Customer_ID = 1:n, Name = rep(c("John", "Sue", "Ann"), n/3))
df2 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Year_First_Order = sample(2000:2017, size = n*2/3, replace = TRUE))
df3 <- data.frame(Customer_ID = 1:n,
Date_Last_Order = rep(c("2017-03-03", "2014-03-01", "2017-05-30"),n/3),
No_Items_Last_Order = sample(1:10, size = n, replace = TRUE),
Total_Amount_Last_Order = sample(20:50, size = n, replace = TRUE))
df4 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Interested_In_Promo = sample(c(TRUE, FALSE), size = n*2/3, replace = TRUE))
```

We use the microbenchmark package to assess the computational speed. First we put our code in functions as input for `microbenchmark`

.

```
m1 <- function(df1, df2, df3, df4){
df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
}
m2 <- function(df1, df2, df3, df4){
df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE,
all.y = FALSE), df_list)
}
```

```
library("microbenchmark")
mbm <- microbenchmark(
mergeBase = m1(df1, df2, df3, df4),
Reduce = m2(df1, df2, df3, df4),
times = 10
)
mbm
```

```
## Unit: seconds
## expr min lq mean median uq max neval cld
## mergeBase 1.028722 1.120588 1.126353 1.131829 1.152066 1.170740 10 a
## Reduce 1.105373 1.144046 1.168305 1.158710 1.179812 1.306395 10 a
```

About 1.15 seconds for one merge. Not bad, but if we would increase the size of our data set slightly, the computation times would increase dramatically.

## data.table’s solution to the above problem

It is relevant to ask, if **more efficient code** can be written that provides the same results. In addition, we ask the question if we can use **better readable syntax** than above?

Let’s replace our `merge` and `Reduce` solutions with data.table code.

We first call the package and store the data frames as data tables. In addition, we set a key for an efficient use of the binary search of the data.table package.

```
library("data.table")
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID)
setkey(dt3, Customer_ID)
setkey(dt4, Customer_ID)
```

The following `data.table`

code gives us exactly the same solution as with our previous base `R`

code.

But note, the syntax is much shorter and more user-friendly! This is all you need to write for the LEFT INNER JOIN for the given problem:

`DT_cmb <- dt4[dt3][dt2][dt1]`

Lets compare the speed of all three approaches (pure merge, Reduce and data.table).

```
m3 <- function(dt1, dt2, dt3, dt4){
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID)
setkey(dt3, Customer_ID)
setkey(dt4, Customer_ID)
DT_cmb <- dt4[dt3][dt2][dt1]
DT_cmb
}
library("microbenchmark")
mbm <- microbenchmark(
mergeBase = m1(df1, df2, df3, df4),
Reduce = m2(df1, df2, df3, df4),
dt = m3(dt1, dt2, dt3, dt4),
times = 10
)
mbm
```

```
## Unit: milliseconds
## expr min lq mean median uq max
## mergeBase 1010.03602 1030.72226 1099.73585 1047.30145 1208.32011 1286.489
## Reduce 968.37101 1039.04347 1097.75323 1048.05562 1201.69135 1284.099
## dt 43.24658 50.42165 66.52315 53.34681 70.50837 140.710
## neval cld
## 10 b
## 10 b
## 10 a
```

The solution with `data.table`

is more than 15 times faster! and the code is much shorter and elegant. Note that the differences becomes even (much) larger if the data size increases.

# Example: Sums of matrix powers

## The lapply/Reduce code:

We can also combine lapply() and Reduce(), such as been carried out in the following example.

```
library("expm")
n <- 2
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
p1 <- function(vec, P){
P_powers <- lapply(vec, function(k) P %^% k)
Reduce("+", P_powers)
}
p1(0:2, P = P)
```

```
## [,1] [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824
```

What is the aim of this code? It not straightforward to see this. First different powers of matrix **P** are applied, then the cumulative sum of resulting matrices is calculated. We can ask if we can do it with a more intuitive and more user-friendly code (just by using a simple ` for `loop), to gain computational speed and to not store the whole list of matrix powers as above.

*General note:* from `R`

version 3.0.0 `for` loops are approx. equally fast than using the `apply`

family (at least for usual problems). Multiple use of the `apply`

family results in hard to read code, especially when several apply statements are given in one line of code, and debugging of code is more difficult.

This is also to some extend the case with the solution above, where `lapply()` and `Reduce()` are used.

## The simple for() loop to the problem:

The following loop just updates a matrix by adding the previous state of the matrix with the powering of the matrix.

```
p2 <- function(vec, P){
P2 <- P %^% vec[1]
for(i in vec[-1]){
P2 <- P2 + P %^% i
}
P2
}
p2(0:2, P = P)
```

```
## [,1] [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824
```

This truly gives the same result. Let’s increase the dimension of **P** and the number of powers applied.

```
n <- 10
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
mbm <- microbenchmark(
Reduce = p1(seq_len(10000), P = P),
Simple = p2(seq_len(10000), P = P),
times = 10
)
mbm
```

```
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## Reduce 122.9000 124.2832 131.1854 133.8611 135.4301 136.7512 10 b
## Simple 109.7845 111.0879 113.7115 112.0759 113.4696 124.5798 10 a
```

The speed is comparable but slightly faster for the simple `for` loop, naturally the code is (much) better readable and debugging of code would be easier when using a `for` loop.

Note that additional gain in computational speed can be obtained using C++ within R. This can be done, e.g. using package Rcpp . However, in this case this is not straightforward to apply and either the environment of package `expm`

must be imported within a `Rcpp`

function or a powering of a matrix method must be first implemented in C++.

# Further reading

Package data.table is incredibly efficient in terms of computational speed. Use it whenever dealing with large (rectangular) data sets. Read its vignettes, tutorials and package manual. Note that – not used here – also package dplyr is fast and very user-friendly.