When A Blind Shot Does Not Hit

missed_shotIn this article, I recount measures and approaches used to deal with a relatively small data set that, in turn, has to be covered “perfectly”. In current academic research and large-scale industrial applications, datasets contain millions to billions (or even more) of documents. While this burdens implementers with considerations of scale at the level of infrastructure, it may make matching comparatively easy: if users are content with a few high quality results, good retrieval effectiveness is simple to attain. Larger datasets are more likely to contain any requested information, linguistically encoded in many different ways, i.e., using different spellings, sentences, grammar, languages, etc.: a “blind shot” will hit a (one of many) target.

However, there are business domains whose main entities’ count will never reach the billions, therefore inherently limiting the document pool. We have recently added another successfully finished CTI-funded project to our track record, which dealt in such a business domain. Dubbed “Stiftungsregister 2.0”[1], the aim of the project was to create an application which enables users to search for all foundations in existence in Switzerland.


Continue reading

Big Data Query Processing with Mixed Workloads

As part of a recent project called Big Data Query Processing we have evaluated complex query workloads using modern Big Data systems. In particular, we have performed benchmarks of Cloudera Impala using a business intelligence use case provided by an industry partner. The results can be found on the following blog post hosted by Cloudera:

How Impala Supports Mixed Workloads in Multi-User Environments