R: Reduce() Part 2 – some pitfalls using Reduce

By Matthias Templ (ZHAW), Thoralf Mildenberger (ZHAW)

By way of example, functionality of Reduce() is shown in in https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/ . It’s great to learn about how to use this function on interesting problems. If you are ready (equals if you read the first blog post on Reduce), we want to push you further on writing efficient code.

Reduce() and other base R functionality might be extremely slow compared to data manipulation functions of package dplyr or data.table . In addition, compared to other software and languages, solving recursive problems with R’s Reduce() is super-slow, see e.g. recursive problems at Rosetta Code , and this is sometimes used to argue that R is slow itself, which is simply not true.

This blog post discusses some drawbacks of Reduce and base R related to user-friendliness of code and computational speed. We re-discuss two examples from https://blog.zhaw.ch/datascience/r-reduce-applys-lesser-known-brother/

Example: Merging multiple data sets

Joining various data sets can be done with many packages, but to the best of my knowledge this could be done with package data.table . The syntax is clear and the computational speed is just amazing.

The former merge() and Reduce() solutions with base R:

df1 <- data.frame(Customer_ID = c(1, 2, 3), Name = c("John", "Sue", "Ann"))
df2 <- data.frame(Customer_ID = c(1, 3), Year_First_Order = c(2011, 2017))
df3 <- data.frame(Customer_ID = c(1, 2, 3), 
                  Date_Last_Order = c("2017-03-03", "2014-03-01", "2017-05-30"),
                  No_Items_Last_Order = c(3, 1, 1),
                  Total_Amount_Last_Order = c(49, 25,25))
df4 <- data.frame(Customer_ID = c(2, 3), Interested_In_Promo = c(TRUE, FALSE))

We could first join the first two data frames, then join the result with the third, and then join the results of that with the fourth table. In the following, a LEFT OUTER join is done using base R:

df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
df_merge
##   Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1           1 John             2011      2017-03-03                   3
## 2           2  Sue               NA      2014-03-01                   1
## 3           3  Ann             2017      2017-05-30                   1
##   Total_Amount_Last_Order Interested_In_Promo
## 1                      49                  NA
## 2                      25                TRUE
## 3                      25               FALSE

When all the data frames are stored in one list, Reduce() can be applied in the following manner:

df_list <- list(df1, df2, df3, df4)
Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, all.y = FALSE), df_list)
##   Customer_ID Name Year_First_Order Date_Last_Order No_Items_Last_Order
## 1           1 John             2011      2017-03-03                   3
## 2           2  Sue               NA      2014-03-01                   1
## 3           3  Ann             2017      2017-05-30                   1
##   Total_Amount_Last_Order Interested_In_Promo
## 1                      49                  NA
## 2                      25                TRUE
## 3                      25               FALSE

But is this computationally efficient?

We need a larger data set to get some insights.

n <- 300000
df1 <- data.frame(Customer_ID = 1:n, Name = rep(c("John", "Sue", "Ann"), n/3))
df2 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Year_First_Order = sample(2000:2017, size = n*2/3, replace = TRUE))
df3 <- data.frame(Customer_ID = 1:n, 
                  Date_Last_Order = rep(c("2017-03-03", "2014-03-01", "2017-05-30"),n/3),
                  No_Items_Last_Order = sample(1:10, size = n, replace = TRUE),
                  Total_Amount_Last_Order = sample(20:50, size = n, replace = TRUE))
df4 <- data.frame(Customer_ID = sample(1:n, size = n*2/3), Interested_In_Promo = sample(c(TRUE, FALSE), size = n*2/3, replace = TRUE))

We use the microbenchmark package to assess the computational speed. First we put our code in functions as input for microbenchmark.

m1 <- function(df1, df2, df3, df4){
  df_merge <- merge(df1, df2, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge <- merge(df_merge, df3, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge <- merge(df_merge, df4, by = "Customer_ID", all.x = TRUE, all.y = FALSE)
  df_merge
}
m2 <- function(df1, df2, df3, df4){
  df_list <- list(df1, df2, df3, df4)
  Reduce(function(d1, d2) merge(d1, d2, by = "Customer_ID", all.x = TRUE, 
                                all.y = FALSE), df_list)
}
library("microbenchmark")
mbm <- microbenchmark(
  mergeBase = m1(df1, df2, df3, df4),
  Reduce = m2(df1, df2, df3, df4),
  times = 10
)
mbm
## Unit: seconds
##       expr      min       lq     mean   median       uq      max neval cld
##  mergeBase 1.028722 1.120588 1.126353 1.131829 1.152066 1.170740    10   a
##     Reduce 1.105373 1.144046 1.168305 1.158710 1.179812 1.306395    10   a

About 1.15 seconds for one merge. Not bad, but if we would increase the size of our data set slightly, the computation times would increase dramatically.

data.table’s solution to the above problem

It is relevant to ask, if more efficient code can be written that provides the same results. In addition, we ask the question if we can use better readable syntax than above?

Let’s replace our merge and Reduce solutions with data.table code.

We first call the package and store the data frames as data tables. In addition, we set a key for an efficient use of the binary search of the data.table package.

library("data.table")
dt1 <- data.table(df1)
dt2 <- data.table(df2)
dt3 <- data.table(df3)
dt4 <- data.table(df4)
setkey(dt1, Customer_ID)
setkey(dt2, Customer_ID) 
setkey(dt3, Customer_ID) 
setkey(dt4, Customer_ID) 

The following data.table code gives us exactly the same solution as with our previous base R code.

But note, the syntax is much shorter and more user-friendly! This is all you need to write for the LEFT INNER JOIN for the given problem:

DT_cmb <- dt4[dt3][dt2][dt1]

Lets compare the speed of all three approaches (pure merge, Reduce and data.table).

m3 <- function(dt1, dt2, dt3, dt4){
  setkey(dt1, Customer_ID)
  setkey(dt2, Customer_ID) 
  setkey(dt3, Customer_ID) 
  setkey(dt4, Customer_ID) 
  DT_cmb <- dt4[dt3][dt2][dt1]
  DT_cmb
}
library("microbenchmark")
mbm <- microbenchmark(
  mergeBase = m1(df1, df2, df3, df4),
  Reduce = m2(df1, df2, df3, df4),
  dt = m3(dt1, dt2, dt3, dt4),
  times = 10
)
mbm
## Unit: milliseconds
##       expr        min         lq       mean     median         uq      max
##  mergeBase 1010.03602 1030.72226 1099.73585 1047.30145 1208.32011 1286.489
##     Reduce  968.37101 1039.04347 1097.75323 1048.05562 1201.69135 1284.099
##         dt   43.24658   50.42165   66.52315   53.34681   70.50837  140.710
##  neval cld
##     10   b
##     10   b
##     10  a

The solution with data.table is more than 15 times faster! and the code is much shorter and elegant. Note that the differences becomes even (much) larger if the data size increases.

Example: Sums of matrix powers

The lapply/Reduce code:

We can also combine lapply() and Reduce(), such as been carried out in the following example.

library("expm")
n <- 2
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
p1 <- function(vec, P){
  P_powers <- lapply(vec, function(k) P %^% k)
  Reduce("+", P_powers)
}
p1(0:2, P = P)
##            [,1]       [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824

What is the aim of this code? It not straightforward to see this. First different powers of matrix P are applied, then the cumulative sum of resulting matrices is calculated. We can ask if we can do it with a more intuitive and more user-friendly code (just by using a simple for loop), to gain computational speed and to not store the whole list of matrix powers as above.

General note: from R version 3.0.0 for loops are approx. equally fast than using the apply family (at least for usual problems). Multiple use of the apply family results in hard to read code, especially when several apply statements are given in one line of code, and debugging of code is more difficult.

This is also to some extend the case with the solution above, where lapply() and Reduce() are used.

The simple for() loop to the problem:

The following loop just updates a matrix by adding the previous state of the matrix with the powering of the matrix.

p2 <- function(vec, P){
  P2 <- P %^% vec[1]
  for(i in vec[-1]){
    P2 <- P2 + P %^% i
  } 
  P2
}
p2(0:2, P = P)
##            [,1]       [,2]
## [1,] 1.01718756 0.01206923
## [2,] 0.01409565 1.01037824

This truly gives the same result. Let’s increase the dimension of P and the number of powers applied.

n <- 10
P <- matrix(runif(n*n, 0.01, 0.02), ncol = n, nrow = n)
mbm <- microbenchmark(
  Reduce = p1(seq_len(10000), P = P),
  Simple = p2(seq_len(10000), P = P),
  times = 10
)
mbm
## Unit: milliseconds
##    expr      min       lq     mean   median       uq      max neval cld
##  Reduce 122.9000 124.2832 131.1854 133.8611 135.4301 136.7512    10   b
##  Simple 109.7845 111.0879 113.7115 112.0759 113.4696 124.5798    10  a

The speed is comparable but slightly faster for the simple for loop, naturally the code is (much) better readable and debugging of code would be easier when using a for loop.

Note that additional gain in computational speed can be obtained using C++ within R. This can be done, e.g. using package Rcpp . However, in this case this is not straightforward to apply and either the environment of package expm must be imported within a Rcpp function or a powering of a matrix method must be first implemented in C++.

Further reading

Package data.table is incredibly efficient in terms of computational speed. Use it whenever dealing with large (rectangular) data sets. Read its vignettes, tutorials and package manual. Note that – not used here – also package dplyr is fast and very user-friendly.

R: Reduce() – apply’s lesser known brother

By Thoralf Mildenberger (ZHAW)

Everybody who knows a bit about R knows that in general loops are said to be evil and should be avoided, both for efficiency reasons and code readability, although one could argue about both.

The usual advice is to use vector operations and apply() and its relatives. sapply(), vapply() and lapply() work by applying a function on each element of a vector or list and return a vector, matrix, array or list of the results. apply() applies a function on one of the dimensions of a matrix or array and returns a vector, matrix or array. These are very useful, but they only work if the function to be applied to the data can be applied to each element independently of each other.

There are cases, however, where we would still use a for loop because the result of applying our operation to an element of the list depends on the results for the previous elements. The R base package provides a function Reduce(), which can come in handy here. Of course it is inspired by functional programming, and actually does something similar to the Reduce step in MapReduce, although it is not inteded for big data applications. Since it seems to be little known even to long-time R users, we will look at a few examples in this post. Continue reading

Conference review of SDS|2017, Kursaal Bern, June 16

by Thilo Stadelmann (ZHAW)

In 2014, ZHAW Datalab started the SDS conference series. It was the year with only one Swiss data scientist identifiable on LinkedIn (at Postfinance…). The year where we talked about “Big Data”, and not “Digitization”. The year where we were unsure if such a thing as a Swiss data science community would exist, and if it actually would come to such an event.

SDS grew from a local workshop to a conference with over 200 participants and international experts as keynote speakers in 2016. This was the year where finally a Swiss-wide network of strong partners form academia and industry emerged to push innovation in data-driven value creation: the Swiss Alliance for Data-Intensive Services (www.data-service-alliance.ch). We as datalabbers have been instrumental in founding this alliance, and then found it to be the perfect partner to take this event to the next level of professionalism.

Continue reading

Who is scared of the big bad robot?

By Lukas Tuggener (ZHAW)
Twitter: https://twitter.com/ltuggener/

Reposted from https://medium.com/@ltuggener/who-is-scared-of-the-big-bad-robot-dc203e2cd7c4

There currently is much talk about the recent developments of AI and how it is going to affect the way we live our lives. While fears of super intelligent robots, which want to end all human live on earth, are mostly held by laymen. There are other concerns, however, which are also very common amongst insiders of the field. Most AI experts agree that the short to midterm impact of AI developments will mostly revolve around automating complex tasks, rather than “artificially intelligent beings”. The well-known AI researcher Andrew Ng put it this way: Continue reading

OpenAI Gym environment for Modelica models

By Gabriel Eyyi (ZHAW)

In this blog post I will show how to combine dynamic models from Modelica with reinforcement learning.

As part of one of my master projects a software environment was developed to examine reinforcement learning algorithms on existing dynamic models from Modelica in order to solve control tasks. Modelica is a non-proprietary, object-oriented, equation based language to conveniently model complex physical systems [1].

The result is the Python library Dymola Reinforcement Learning (dymrl) which allows you to explore reinforcement learning algorithms for dynamical systems.

The code of this project can be found at github.

Continue reading

Review of SDS|2016

By Amrita Prasad (ZHAW)

It’s already been a month since we met as the Swiss Data Science community at our 3rd Swiss Conference on Data Science (SDS|2016), pushed again by ZHAW’s Datalab group and presented by SAP Switzerland.

Several additional organisations sponsored and supported the conference to give it a successful execution – the organising committee thanks IT Logix & Microsoft, PwC, Google, Zühlke, SGAICO, Hasler Stiftung and the Swiss Alliance for Data-Intensive Services for their support in bringing together a successful event! Continue reading

Data scientists type A & B? Nonsense.

By Thilo Stadelmann (ZHAW)

Reposted from https://dublin.zhaw.ch/~stdm/?p=350#more-350

I recently came about the notion of “type A” and “type B” data scientists. While the “type A” is basically a trained statistician that has broadened his field towards modern use cases (“data science for people”), the same is true for “type B” (B for “build”, “data science for software”) that has his roots in programming and contributes stronger to code and systems in the backend.

Frankly, I haven’t come about a practically more useless distinction since the inception of the term “data science”. Data science is the name for a new discipline that is in itself interdisciplinary [see e.g. here – but beware of German text]. The whole point of interdisciplinarity, and by extension of data science, is for proponent to think outside the box of his or her original discipline (which might be be statistics, computer science, physics, economics or something completely different), and acquire skills in the neighboring disciplines in order to tackle problems outside of intellectual silos. Encouraging practitioners to stay in their silos, as this A/B typology suggests, is counterproductive at best, fatal at worst. Continue reading

Deep learning for lazybones

By Oliver Dürr (ZHAW)

Reposted from http://oduerr.github.io/blog/2016/04/06/Deep-Learning_for_lazybones

In this blog I explore the possibility to use a trained CNN on one image dataset (ILSVRC) as feature extractor for another image dataset (CIFAR-10). The code using TensorFlow can be found at github. Continue reading

SDS|2015 Conference Review

The Swiss Data Science community recently met at SDS|2015, the 2nd Swiss Workshop on Data Science. It was a full day event organized by ZHAW Datalab, with inspiring talks, hands-on data expeditions, and an excellent provision of space and atmosphere for fruitful networking. The conference took place on the 12th of June at the premises of ZHAW in Winterthur. It attracted people with a wide range of skills, expertise, and levels from doers to managers, and had very strong support from industry, hence showing the huge potential and scope of the subject.

Review

Jan-Egbert SturmAfter having the workshop kicked-off by the President of ZHAW, Dr. Jean Egbert Sturm, Professor of Microeconomics & Director of KOF Swiss Economic Institute at ETH, gave an insightful keynote talk on The use of ever increasing datasets in Macroeconomic forecasting. He explained to the audience the way to do economic forecasting using simple and standard analytical techniques. It was specifically very interesting for data analytics experts to see such a methodology that successfully uses down-to-earth analytical techniques integrated with in-depth knowledge of Economics. Continue reading

When A Blind Shot Does Not Hit

missed_shotIn this article, I recount measures and approaches used to deal with a relatively small data set that, in turn, has to be covered “perfectly”. In current academic research and large-scale industrial applications, datasets contain millions to billions (or even more) of documents. While this burdens implementers with considerations of scale at the level of infrastructure, it may make matching comparatively easy: if users are content with a few high quality results, good retrieval effectiveness is simple to attain. Larger datasets are more likely to contain any requested information, linguistically encoded in many different ways, i.e., using different spellings, sentences, grammar, languages, etc.: a “blind shot” will hit a (one of many) target.

However, there are business domains whose main entities’ count will never reach the billions, therefore inherently limiting the document pool. We have recently added another successfully finished CTI-funded project to our track record, which dealt in such a business domain. Dubbed “Stiftungsregister 2.0”[1], the aim of the project was to create an application which enables users to search for all foundations in existence in Switzerland.

StiftungSchweizLogoRZ

Continue reading