In this article, I recount measures and approaches used to deal with a relatively small data set that, in turn, has to be covered “perfectly”. In current academic research and large-scale industrial applications, datasets contain millions to billions (or even more) of documents. While this burdens implementers with considerations of scale at the level of infrastructure, it may make matching comparatively easy: if users are content with a few high quality results, good retrieval effectiveness is simple to attain. Larger datasets are more likely to contain any requested information, linguistically encoded in many different ways, i.e., using different spellings, sentences, grammar, languages, etc.: a “blind shot” will hit a (one of many) target.
However, there are business domains whose main entities’ count will never reach the billions, therefore inherently limiting the document pool. We have recently added another successfully finished CTI-funded project to our track record, which dealt in such a business domain. Dubbed “Stiftungsregister 2.0”, the aim of the project was to create an application which enables users to search for all foundations in existence in Switzerland.
The information engineering group at InIT has recently successfully concluded a CTI funded project. The project is called “expert-match” and has been conducted in cooperation with expert group ag, a professional recruitment and consulting business head-quartered in Zurich, Switzerland.
Expert group’s recruiting focus consists of staffing high-profile expert positions (e.g. specialized senior software engineers, senior management, IT architects). Usually, there are at most a handful of people qualified for a recruiting mandate within the relevant geographical region they operate in. The problem is further compounded by the fact that qualifications are rarely obvious, requiring insight into candidates’ former positions and overall skill sets. Exact matching of job descriptions with candidate profiles, such as in a database environment, is thus unlikely to find the desired experts. By implementing an information retrieval (IR) application, we were able to mitigate the problem. The application fully supports iterative searches, especially focussing on relevance feedback.
In a recent CTI project with our industry partner Nektoon AG we were involved in the development of the context intelligence application Squirro. In Squirro, users can create topics that consist of various text streams such as RSS feeds, blogs and Facebook accounts (see for example the following marketing video from Nektoon):
One particular problem was to design and implement a method to identify text documents in a stream that a user might be interested to read. For example in an RSS feed of a company, a user might only be interested in a specific product of this particular company. Thus, he will generally ignore documents about other topics and would prefer to not seeing these anymore. The chosen approach is to infer the future interests of a user based on his past interactions with documents. From these actions we can determine a set of documents which the user is expected to be interested in and create a profile for each user using state of the art text feature selection methods. This allows us to calculate how well a document matches the usual interest of a user. According to this ranking we sort the documents and thus documents matching the user’s interest profile most closely rise to the top ranks.