R: Reduce() Part 2 – some pitfalls using Reduce

By Matthias Templ (ZHAW), Thoralf Mildenberger (ZHAW)

By way of example, functionality of Reduce() is shown in in https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/ . It’s great to learn about how to use this function on interesting problems. If you are ready (equals if you read the first blog post on Reduce), we want to push you further on writing efficient code.

Reduce() and other base R functionality might be extremely slow compared to data manipulation functions of package dplyr or data.table . In addition, compared to other software and languages, solving recursive problems with R’s Reduce() is super-slow, see e.g. recursive problems at Rosetta Code , and this is sometimes used to argue that R is slow itself, which is simply not true.

This blog post discusses some drawbacks of Reduce and base R related to user-friendliness of code and computational speed. We re-discuss two examples from https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/

Example: Merging multiple data sets

Joining various data sets can be done with many packages, but to the best of my knowledge this could be done with package data.table . The syntax is clear and the computational speed is just amazing.

The former merge() and Reduce() solutions with base R:

df1 <- data.frame(Customer_ID = c(1, 2, 3), Name = c("John", "Sue", "Ann"))
df2 <- data.frame(Customer_ID = c(1, 3), Year_First_Order = c(2011, 2017))
df3 <- data.frame(Customer_ID = c(1, 2, 3), 
                  Date_Last_Order = c("2017-03-03", "2014-03-01", "2017-05-30"),
                  No_Items_Last_Order = c(3, 1, 1),
                  Total_Amount_Last_Order = c(49, 25,25))
df4 <- data.frame(Customer_ID = c(2, 3), Interested_In_Promo = c(TRUE, FALSE))

We could first join the first two data frames, then join the result with the third, and then join the results of that with the fourth table. In the following, a LEFT OUTER join is done using base R:

df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
##   Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1           1 John             2011      2017-03-03                   3
## 2           2  Sue               NA      2014-03-01                   1
## 3           3  Ann             2017      2017-05-30                   1
##   Total_Amount_Last_Order Interested_In_Promo
## 1                      49                  NA
## 2                      25                TRUE
## 3                      25               FALSE

When all the data frames are stored in one list, Reduce() can be applied in the following manner:

df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, all.y = FALSE), df_list)
##   Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1           1 John             2011      2017-03-03                   3
## 2           2  Sue               NA      2014-03-01                   1
## 3           3  Ann             2017      2017-05-30                   1
##   Total_Amount_Last_Order Interested_In_Promo
## 1                      49                  NA
## 2                      25                TRUE
## 3                      25               FALSE

But is this computationally efficient?

We need a larger data set to get some insights.

n <- 300000
df1 <- data.frame(Customer_ID = 1:n, Name = rep(c("John", "Sue", "Ann"), n/3))
df2 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Year_First_Order = sample(2000:2017, size = n*2/3, replace = TRUE))
df3 <- data.frame(Customer_ID = 1:n, 
                  Date_Last_Order = rep(c("2017-03-03", "2014-03-01", "2017-05-30"),n/3),
                  No_Items_Last_Order = sample(1:10, size = n, replace = TRUE),
                  Total_Amount_Last_Order = sample(20:50, size = n, replace = TRUE))
df4 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Interested_In_Promo = sample(c(TRUE, FALSE), size = n*2/3, replace = TRUE))

We use the microbenchmark package to assess the computational speed. First we put our code in functions as input for microbenchmark.

m1 <- function(df1, df2, df3, df4){
  df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge
}
m2 <- function(df1, df2, df3, df4){
  df_list <- list(df1, df2, df3, df4)
  Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, 
                                all.y = FALSE), df_list)
}
library("microbenchmark")
mbm <- microbenchmark(
  mergeBase = m1(df1, df2, df3, df4),
  Reduce = m2(df1, df2, df3, df4),
  times = 10
)
mbm
## Unit: seconds
##       expr      min       lq     mean   median       uq      max neval cld
##  mergeBase 1.028722 1.120588 1.126353 1.131829 1.152066 1.170740    10   a
##     Reduce 1.105373 1.144046 1.168305 1.158710 1.179812 1.306395    10   a

About 1.15 seconds for one merge. Not bad, but if we would increase the size of our data set slightly, the computation times would increase dramatically.

data.table’s solution to the above problem

It is relevant to ask, if more efficient code can be written that provides the same results. In addition, we ask the question if we can use better readable syntax than above?

Let’s replace our merge and Reduce solutions with data.table code.

We first call the package and store the data frames as data tables. In addition, we set a key for an efficient use of the binary search of the data.table package.

library("data.table")
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID) 
setkey(dt3, Customer_ID) 
setkey(dt4, Customer_ID) 

The following data.table code gives us exactly the same solution as with our previous base R code.

But note, the syntax is much shorter and more user-friendly! This is all you need to write for the LEFT INNER JOIN for the given problem:

DT_cmb <- dt4[dt3][dt2][dt1]

Lets compare the speed of all three approaches (pure merge, Reduce and data.table).

m3 <- function(dt1, dt2, dt3, dt4){
  setkey(dt1, Customer_ID)
  setkey(dt2, Customer_ID) 
  setkey(dt3, Customer_ID) 
  setkey(dt4, Customer_ID) 
  DT_cmb <- dt4[dt3][dt2][dt1]
  DT_cmb
}
library("microbenchmark")
mbm <- microbenchmark(
  mergeBase = m1(df1, df2, df3, df4),
  Reduce = m2(df1, df2, df3, df4),
  dt = m3(dt1, dt2, dt3, dt4),
  times = 10
)
mbm
## Unit: milliseconds
##       expr        min         lq       mean     median         uq      max
##  mergeBase 1010.03602 1030.72226 1099.73585 1047.30145 1208.32011 1286.489
##     Reduce  968.37101 1039.04347 1097.75323 1048.05562 1201.69135 1284.099
##         dt   43.24658   50.42165   66.52315   53.34681   70.50837  140.710
##  neval cld
##     10   b
##     10   b
##     10  a

The solution with data.table is more than 15 times faster! and the code is much shorter and elegant. Note that the differences becomes even (much) larger if the data size increases.

Example: Sums of matrix powers

The lapply/Reduce code:

We can also combine lapply() and Reduce(), such as been carried out in the following example.

library("expm")
n <- 2
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
p1 <- function(vec, P){
  P_powers <- lapply(vec, function(k) P %^% k)
  Reduce("+", P_powers)
}
p1(0:2, P = P)
##            [,1]       [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824

What is the aim of this code? It not straightforward to see this. First different powers of matrix P are applied, then the cumulative sum of resulting matrices is calculated. We can ask if we can do it with a more intuitive and more user-friendly code (just by using a simple for loop), to gain computational speed and to not store the whole list of matrix powers as above.

General note: from R version 3.0.0 for loops are approx. equally fast than using the apply family (at least for usual problems). Multiple use of the apply family results in hard to read code, especially when several apply statements are given in one line of code, and debugging of code is more difficult.

This is also to some extend the case with the solution above, where lapply() and Reduce() are used.

The simple for() loop to the problem:

The following loop just updates a matrix by adding the previous state of the matrix with the powering of the matrix.

p2 <- function(vec, P){
  P2 <- P %^% vec[1]
  for(i in vec[-1]){
    P2 <- P2 + P %^% i
  } 
  P2
}
p2(0:2, P = P)
##            [,1]       [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824

This truly gives the same result. Let’s increase the dimension of P and the number of powers applied.

n <- 10
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
mbm <- microbenchmark(
  Reduce = p1(seq_len(10000), P = P),
  Simple = p2(seq_len(10000), P = P),
  times = 10
)
mbm
## Unit: milliseconds
##    expr      min       lq     mean   median       uq      max neval cld
##  Reduce 122.9000 124.2832 131.1854 133.8611 135.4301 136.7512    10   b
##  Simple 109.7845 111.0879 113.7115 112.0759 113.4696 124.5798    10  a

The speed is comparable but slightly faster for the simple for loop, naturally the code is (much) better readable and debugging of code would be easier when using a for loop.

Note that additional gain in computational speed can be obtained using C++ within R. This can be done, e.g. using package Rcpp . However, in this case this is not straightforward to apply and either the environment of package expm must be imported within a Rcpp function or a powering of a matrix method must be first implemented in C++.

Further reading

Package data.table is incredibly efficient in terms of computational speed. Use it whenever dealing with large (rectangular) data sets. Read its vignettes, tutorials and package manual. Note that – not used here – also package dplyr is fast and very user-friendly.

R: Reduce() – apply’s lesser known brother

By Thoralf Mildenberger (ZHAW)

Everybody who knows a bit about R knows that in general loops are said to be evil and should be avoided, both for efficiency reasons and code readability, although one could argue about both.

The usual advice is to use vector operations and apply() and its relatives. sapply(), vapply() and lapply() work by applying a function on each element of a vector or list and return a vector, matrix, array or list of the results. apply() applies a function on one of the dimensions of a matrix or array and returns a vector, matrix or array. These are very useful, but they only work if the function to be applied to the data can be applied to each element independently of each other.

There are cases, however, where we would still use a for loop because the result of applying our operation to an element of the list depends on the results for the previous elements. The R base package provides a function Reduce(), which can come in handy here. Of course it is inspired by functional programming, and actually does something similar to the Reduce step in MapReduce, although it is not inteded for big data applications. Since it seems to be little known even to long-time R users, we will look at a few examples in this post. Continue reading