LivingKnowledge goal is to bring a new quality into search and knowledge management technology for more concise, complete and contextualised search results.
The paper “Exploring a corpus of scientific texts using data mining” co-written by Teich and Fankhauser has been published in Corpus-linguistic applications, Current studies, new directions; Gries, Stefan Th., Stefanie Wulff and Mark Davies (Eds.), Amsterdam/New York, NY, 2010
We report on a project investigating the linguistic properties of English scientific texts on the basis of a corpus of journal articles from nine academic disciplines. The goal of the project is to gain insights on registers emerging at the boundaries of computer science and some other discipline (e.g., bioinformatics, computational linguistics, computational engineering). The questions we focus on in this paper are (a) how characteristic is the corpus of the meta-register it represents, and (b) how different/similar are the subcorpora in terms of the more specific registers they instantiate. We analyze the corpus using several data mining techniques, including feature ranking, clustering and classification, to see how the subcorpora group in terms of selected linguistic features. The results show that our corpus is well distinguished in terms of the meta-register of scientific writing; also, we find interesting distinctive features for the subcorpora as indicators of register diversification. Apart from presenting the results of our analyses, we will also reflect upon and assess the use of data mining for the tasks of corpus exploration and analysis.