A keyquery-based classification system for CORE

  • We apply keyquery-based taxonomy composition to compute a classification system for the CORE dataset, a shared crawl of about 850,000 scientific papers. Keyquery-based taxonomy composition can be understood as a two-phase hierarchical document clustering technique that utilizes search queries as cluster labels: In a first phase, the document collection is indexed by a reference search engine, andWe apply keyquery-based taxonomy composition to compute a classification system for the CORE dataset, a shared crawl of about 850,000 scientific papers. Keyquery-based taxonomy composition can be understood as a two-phase hierarchical document clustering technique that utilizes search queries as cluster labels: In a first phase, the document collection is indexed by a reference search engine, and the documents are tagged with the search queries they are relevant—for their so-called keyqueries. In a second phase, a hierarchical clustering is formed from the keyqueries within an iterative process. We use the explicit topic model ESA as document retrieval model in order to index the CORE dataset in the reference search engine. Under the ESA retrieval model, documents are represented as vectors of similarities to Wikipedia articles; a methodology proven to be advantageous for text categorization tasks. Our paper presents the generated taxonomy and reports on quantitative properties such as document coverage and processing requirements.show moreshow less

Download full text files

  • Volltexteng
    (435KB)

    Manuskriptfassung, Zweitveröffentlichung

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Document Type:Article
Author: Michael Völske, Tim Gollub, Matthias Hagen, Benno Stein
DOI (Cite-Link):https://doi.org/10.1045/november14-voelskeCite-Link
URN (Cite-Link):https://nbn-resolving.org/urn:nbn:de:gbv:wim2-20170426-31662Cite-Link
Parent Title (English):D-Lib Magazine
Language:English
Date of Publication (online):2017/04/26
Year of first Publication:2014
Release Date:2017/04/26
Publishing Institution:Bauhaus-Universität Weimar
Institutes:Bauhaus-Universität Weimar / In Zusammenarbeit mit der Bauhaus-Universität Weimar
Tag:Dynamic Taxonomy Composition, Keyquery, Classification Systems, Reverted Index, Big Data Problem
GND Keyword:Massendaten; Taxonomie
Dewey Decimal Classification:000 Informatik, Informationswissenschaft, allgemeine Werke / 000 Informatik, Wissen, Systeme
BKL-Classification:54 Informatik / 54.82 Textverarbeitung
Licence (German):License Logo Zweitveröffentlichung