By Matthias Templ (ZHAW), Thoralf Mildenberger (ZHAW)
By way of example, functionality of Reduce()
is shown in in https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/ . It’s great to learn about how to use this function on interesting problems. If you are ready (equals if you read the first blog post on Reduce), we want to push you further on writing efficient code.
Reduce()
and other base R
functionality might be extremely slow compared to data manipulation functions of package dplyr or data.table . In addition, compared to other software and languages, solving recursive problems with R
’s Reduce()
is super-slow, see e.g. recursive problems at Rosetta Code , and this is sometimes used to argue that R
is slow itself, which is simply not true.
This blog post discusses some drawbacks of Reduce
and base R
related to user-friendliness of code and computational speed. We re-discuss two examples from https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/
Example: Merging multiple data sets
Joining various data sets can be done with many packages, but to the best of my knowledge this could be done with package data.table . The syntax is clear and the computational speed is just amazing.
The former merge() and Reduce() solutions with base R:
df1 <- data.frame(Customer_ID = c(1, 2, 3), Name = c("John", "Sue", "Ann"))
df2 <- data.frame(Customer_ID = c(1, 3), Year_First_Order = c(2011, 2017))
df3 <- data.frame(Customer_ID = c(1, 2, 3),
Date_Last_Order = c("2017-03-03", "2014-03-01", "2017-05-30"),
No_Items_Last_Order = c(3, 1, 1),
Total_Amount_Last_Order = c(49, 25,25))
df4 <- data.frame(Customer_ID = c(2, 3), Interested_In_Promo = c(TRUE, FALSE))
We could first join the first two data frames, then join the result with the third, and then join the results of that with the fourth table. In the following, a LEFT OUTER join is done using base R
:
df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
## Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1 1 John 2011 2017-03-03 3
## 2 2 Sue NA 2014-03-01 1
## 3 3 Ann 2017 2017-05-30 1
## Total_Amount_Last_Order Interested_In_Promo
## 1 49 NA
## 2 25 TRUE
## 3 25 FALSE
When all the data frames are stored in one list, Reduce()
can be applied in the following manner:
df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, all.y = FALSE), df_list)
## Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1 1 John 2011 2017-03-03 3
## 2 2 Sue NA 2014-03-01 1
## 3 3 Ann 2017 2017-05-30 1
## Total_Amount_Last_Order Interested_In_Promo
## 1 49 NA
## 2 25 TRUE
## 3 25 FALSE
But is this computationally efficient?
We need a larger data set to get some insights.
n <- 300000
df1 <- data.frame(Customer_ID = 1:n, Name = rep(c("John", "Sue", "Ann"), n/3))
df2 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Year_First_Order = sample(2000:2017, size = n*2/3, replace = TRUE))
df3 <- data.frame(Customer_ID = 1:n,
Date_Last_Order = rep(c("2017-03-03", "2014-03-01", "2017-05-30"),n/3),
No_Items_Last_Order = sample(1:10, size = n, replace = TRUE),
Total_Amount_Last_Order = sample(20:50, size = n, replace = TRUE))
df4 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Interested_In_Promo = sample(c(TRUE, FALSE), size = n*2/3, replace = TRUE))
We use the microbenchmark package to assess the computational speed. First we put our code in functions as input for microbenchmark
.
m1 <- function(df1, df2, df3, df4){
df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
}
m2 <- function(df1, df2, df3, df4){
df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE,
all.y = FALSE), df_list)
}
library("microbenchmark")
mbm <- microbenchmark(
mergeBase = m1(df1, df2, df3, df4),
Reduce = m2(df1, df2, df3, df4),
times = 10
)
mbm
## Unit: seconds
## expr min lq mean median uq max neval cld
## mergeBase 1.028722 1.120588 1.126353 1.131829 1.152066 1.170740 10 a
## Reduce 1.105373 1.144046 1.168305 1.158710 1.179812 1.306395 10 a
About 1.15 seconds for one merge. Not bad, but if we would increase the size of our data set slightly, the computation times would increase dramatically.
data.table’s solution to the above problem
It is relevant to ask, if more efficient code can be written that provides the same results. In addition, we ask the question if we can use better readable syntax than above?
Let’s replace our merge and Reduce solutions with data.table code.
We first call the package and store the data frames as data tables. In addition, we set a key for an efficient use of the binary search of the data.table package.
library("data.table")
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID)
setkey(dt3, Customer_ID)
setkey(dt4, Customer_ID)
The following data.table
code gives us exactly the same solution as with our previous base R
code.
But note, the syntax is much shorter and more user-friendly! This is all you need to write for the LEFT INNER JOIN for the given problem:
DT_cmb <- dt4[dt3][dt2][dt1]
Lets compare the speed of all three approaches (pure merge, Reduce and data.table).
m3 <- function(dt1, dt2, dt3, dt4){
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID)
setkey(dt3, Customer_ID)
setkey(dt4, Customer_ID)
DT_cmb <- dt4[dt3][dt2][dt1]
DT_cmb
}
library("microbenchmark")
mbm <- microbenchmark(
mergeBase = m1(df1, df2, df3, df4),
Reduce = m2(df1, df2, df3, df4),
dt = m3(dt1, dt2, dt3, dt4),
times = 10
)
mbm
## Unit: milliseconds
## expr min lq mean median uq max
## mergeBase 1010.03602 1030.72226 1099.73585 1047.30145 1208.32011 1286.489
## Reduce 968.37101 1039.04347 1097.75323 1048.05562 1201.69135 1284.099
## dt 43.24658 50.42165 66.52315 53.34681 70.50837 140.710
## neval cld
## 10 b
## 10 b
## 10 a
The solution with data.table
is more than 15 times faster! and the code is much shorter and elegant. Note that the differences becomes even (much) larger if the data size increases.
Example: Sums of matrix powers
The lapply/Reduce code:
We can also combine lapply() and Reduce(), such as been carried out in the following example.
library("expm")
n <- 2
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
p1 <- function(vec, P){
P_powers <- lapply(vec, function(k) P %^% k)
Reduce("+", P_powers)
}
p1(0:2, P = P)
## [,1] [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824
What is the aim of this code? It not straightforward to see this. First different powers of matrix P are applied, then the cumulative sum of resulting matrices is calculated. We can ask if we can do it with a more intuitive and more user-friendly code (just by using a simple for loop), to gain computational speed and to not store the whole list of matrix powers as above.
General note: from R
version 3.0.0 for loops are approx. equally fast than using the apply
family (at least for usual problems). Multiple use of the apply
family results in hard to read code, especially when several apply statements are given in one line of code, and debugging of code is more difficult.
This is also to some extend the case with the solution above, where lapply() and Reduce() are used.
The simple for() loop to the problem:
The following loop just updates a matrix by adding the previous state of the matrix with the powering of the matrix.
p2 <- function(vec, P){
P2 <- P %^% vec[1]
for(i in vec[-1]){
P2 <- P2 + P %^% i
}
P2
}
p2(0:2, P = P)
## [,1] [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824
This truly gives the same result. Let’s increase the dimension of P and the number of powers applied.
n <- 10
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
mbm <- microbenchmark(
Reduce = p1(seq_len(10000), P = P),
Simple = p2(seq_len(10000), P = P),
times = 10
)
mbm
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## Reduce 122.9000 124.2832 131.1854 133.8611 135.4301 136.7512 10 b
## Simple 109.7845 111.0879 113.7115 112.0759 113.4696 124.5798 10 a
The speed is comparable but slightly faster for the simple for loop, naturally the code is (much) better readable and debugging of code would be easier when using a for loop.
Note that additional gain in computational speed can be obtained using C++ within R. This can be done, e.g. using package Rcpp . However, in this case this is not straightforward to apply and either the environment of package expm
must be imported within a Rcpp
function or a powering of a matrix method must be first implemented in C++.
Further reading
Package data.table is incredibly efficient in terms of computational speed. Use it whenever dealing with large (rectangular) data sets. Read its vignettes, tutorials and package manual. Note that – not used here – also package dplyr is fast and very user-friendly.