• search hit 69 of 83
Back to Result List

Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia

  • Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it hasWeb applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows: (1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw. (2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content. (3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time. (4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach.show moreshow less

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Document Type:Doctoral Thesis
Author: Maik Anderka
DOI (Cite-Link):https://doi.org/10.25643/bauhaus-universitaet.1977Cite-Link
URN (Cite-Link):https://nbn-resolving.org/urn:nbn:de:gbv:wim2-20130709-19778Cite-Link
Referee:Prof. Dr. Michael Granitzer
Advisor:Prof. Dr. Benno Stein
Language:English
Date of Publication (online):2013/07/09
Year of first Publication:2013
Date of final exam:2013/06/28
Release Date:2013/07/09
Publishing Institution:Bauhaus-Universität Weimar
Granting Institution:Bauhaus-Universität Weimar, Fakultät Medien
Institutes:Fakultät Medien / Professur Content Management - Web-Technologien
Tag:Information Quality Assessment; Quality Flaw Prediction; User-generated Content Analysis
GND Keyword:Data Mining; Machine Learning; Wikipedia
Dewey Decimal Classification:000 Informatik, Informationswissenschaft, allgemeine Werke
BKL-Classification:54 Informatik
Licence (German):License Logo Creative Commons 4.0 - Namensnennung-Nicht kommerziell-Keine Bearbeitung (CC BY-NC-ND 4.0)