In this article, I recount measures and approaches used to deal with a relatively small data set that, in turn, has to be covered “perfectly”. In current academic research and large-scale industrial applications, datasets contain millions to billions (or even more) of documents. While this burdens implementers with considerations of scale at the level of infrastructure, it may make matching comparatively easy: if users are content with a few high quality results, good retrieval effectiveness is simple to attain. Larger datasets are more likely to contain any requested information, linguistically encoded in many different ways, i.e., using different spellings, sentences, grammar, languages, etc.: a “blind shot” will hit a (one of many) target.
However, there are business domains whose main entities’ count will never reach the billions, therefore inherently limiting the document pool. We have recently added another successfully finished CTI-funded project to our track record, which dealt in such a business domain. Dubbed “Stiftungsregister 2.0”, the aim of the project was to create an application which enables users to search for all foundations in existence in Switzerland.
In the following, I describe the problems with a small collection size encountered in the Stiftungsregister project and approaches to deal with them. Keep in mind that “small” is strictly used as a qualifier in regards to search problems typically covered by information retrieval systems. The collection in this system is not small at all considering the domain it covers: it is in fact the largest collection of information on Swiss foundations and contains an unprecedented amount of information.
Continuing with the case at hand, let us look at the main issues:
- Relatively small and mostly stable pool of documents (foundations)
- Comparatively little initial information per document
- Strongly varying text length per document
The number of charitable foundations in Switzerland amounts to about 13’000, not counting further charitable organizations. For the purpose of the project, each document represents a foundation and all of its available information. Foundations are created and dissolved on a regular basis but usually exhibit more stable lifetimes than companies. Even intranet search engines of SMEs can be expected to deal with larger document counts and possibly even with multiple business domains and taxonomies.
A foundation’s registration information contains only a few mandatory points: The name of the foundation, a description of purpose, an address and the names of the foundation council’s members. In contrast to a typical website, for example, a foundation’s description offers less text. While the purpose of a foundation must be described by law upon registration, the law does not foresee formal guidelines for either length or particularity of such a description. To maintain maximum freedom of operations and guarantee the longevity of the description, the descriptions are mostly generic and similarly worded.
As elaborated on within the previous paragraph, the main textual description a foundation’s registration offers is its purpose. Larger, e.g. company-run, foundations provide an in-depth description. Most foundations, however, are privately created and their descriptions are not burdened with highlighting another organization’s involvement. Some foundations also seek obscurity for unknown reasons. The latter types will usually give a very short and vague description of purpose, about as long as a large tweet.
As a compounding difficulty to the project, information was provided in German, French and Italian, roughly in the same distribution as the languages occur in Switzerland. Additionally, there are some mixed-language descriptions, some of them also containing English. Multilingual information retrieval is not the topic of this article, however, so I defer to other literature or future articles.
The use case of finding funding through foundations requires both high precision and recall. The precision aspect helps retrieving foundations which fit a particular project at top ranks of the search result. Fewer but more precisely employed applications help reduce the workload of funders and entities seeking funding. On the other hand, recall is equally important when there are only a limited number of foundations which are at all interested in funding a particular type of project. Many keywords, that would hit millions of easy matches in Web search engines, match little to none descriptions in this collection. Any foundation which is topically similar to a project should be retrieved so that it may be considered for an application.
One of the best understood and most used approaches to increase recall is relevance feedback. One of our previous articles gave an insight into relevance feedback details.
Another approach is continuous augmentation and extension of documents. By increasing text content of documents, more terms can potentially match any query. Great care must be taken to only add truly relevant information to the documents. In the case of the Stiftungsregister, two sources of additional information are utilized. The first is an offline “seeding” of selected foundations. The most prominent foundations’ descriptions are extended by their own published content which exceeds the registration information by far. Additionally, the application supports community annotations which are added to the descriptions once they have been verified by community consensus or the foundations themselves.
The issue of varying text lengths is primarily offset by using a length-normalizing weighting scheme, e.g. the state-of-the-art BM25 [Robertson et al. 1995] or derivatives. However, there still remains a significant biasing effect. Short documents have fewer terms which are, by virtue of normalization, treated as being much more characteristic as they would be in the context of large documents. Short and generic queries may therefore retrieve a large number of tersely and likewise generically described foundations. For a scientific treatment of this bias in favour of shorter documents and possible remedies, see e.g. [Lv & Zhai, 2011]. Fortunately, this hardly affected the application. Funding-seeking users tend to issue distinctive (although possibly still short) queries pertaining to their projects. The peculiarity of their queries leads to recall being much more of an issue than precision.
Addendum: Stiftungsregister 2.0 Context
Foundations must be registered in Switzerland but the federated nature of registrations and predominant publication in print makes a complete overview of foundations (and companies) rather difficult. Up until the launch of www.stiftungschweiz.ch, there has been no central service which offered a complete list and accompanying search functionality of foundations within the country’s borders. While there is a federal online registry, it only contains federations governed by the swiss federal administration (approx. 4000 entries), as opposed to the multitude of foundations governed on communal or cantonal levels. The available data is therefore inherently fragmented between administrations on three hierarchical levels.
The Stiftungsregister’s first great accomplishment is providing access to foundations on all administrative levels. Offering useful search and browsing functionality on this comprehensive collection of information is the second achievement of the project.
|[Robertson et al., 1995]||Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. NIST SPECIAL PUBLICATION SP, 109-109.|
|[Lv & Zhai, 2011]||Lv, Y., & Zhai, C. (2011, July). When documents are very long, BM25 fails!. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 1103-1104). ACM.|