LivingKnowledge goal is to bring a new quality into search and knowledge management technology for more concise, complete and contextualised search results.
The paper “Scalable Knowledge Harvesting with High Precision and High Recall” co-written by N. Nakashole, M. Theobald and G. Weikum has been presented at ACM WSDM 2011 in Hong Kong on February 9th-12th, 2011 and published in ACM Digital Library
Harvesting relational facts from Web sources has received great attention for automatically constructing large knowledge bases. Stateof-the-art approaches combine pattern-based gathering of fact candidates with constraint-based reasoning. However, they still face major challenges regarding the trade-offs between precision, recall, and scalability. Techniques that scale well are susceptible to noisy patterns that degrade precision, while techniques that employ deep reasoning for high precision cannot cope with Web-scale data.
This paper presents a scalable system, called PROSPERA, for high-quality knowledge harvesting. We propose a new notion of ngram-itemsets for richer patterns, and use MaxSat-based constraint reasoning on both the quality of patterns and the validity of fact candidates.We compute pattern-occurrence statistics for two benefits: they serve to prune the hypotheses space and to derive informative weights of clauses for the reasoner. The paper shows how to incorporate these building blocks into a scalable architecture that can parallelize all phases on a Hadoop-based distributed platform. Our experiments with the ClueWeb09 corpus include comparisons to the recent ReadTheWeb experiment. We substantially outperform these prior results in terms of recall, with the same precision, while having low run-times.