TY  - THES
A1  - Ajjour, Yamen
T1  - Addressing Controversial Topics in Search Engines
N2  - Search engines are very good at answering queries that look for facts. Still, information needs that concern forming opinions on a controversial topic or making a decision remain a challenge for search engines. Since they are optimized to retrieve satisfying answers, search engines might emphasize a specific stance on a controversial topic in their ranking, amplifying bias in society in an undesired way. Argument retrieval systems support users in forming opinions about controversial topics by retrieving arguments for a given query. In this thesis, we address challenges in argument retrieval systems that concern integrating them in search engines, developing generalizable argument mining approaches, and enabling frame-guided delivery of arguments.

Adapting argument retrieval systems to search engines should start by identifying and analyzing information needs that look for arguments. To identify questions that look for arguments we develop a two-step annotation scheme that first identifies whether the context of a question is controversial, and if so, assigns it one of several question types: factual, method, and argumentative. Using this annotation scheme, we create a question dataset from the logs of a major search engine and use it to analyze the characteristics of argumentative questions. The analysis shows that the proportion of argumentative questions on controversial topics is substantial and that they mainly ask for reasons and predictions. The dataset is further used to develop a classifier to uniquely map questions to the question types, reaching a convincing F1-score of 0.78.

While the web offers an invaluable source of argumentative content to respond to argumentative questions, it is characterized by multiple genres (e.g., news articles and social fora). Exploiting the web as a source of arguments relies on developing argument mining approaches that generalize over genre. To this end, we approach the problem of how to extract argument units in a genre-robust way. Our experiments on argument unit segmentation show that transfer across genres is rather hard to achieve using existing sequence-to-sequence models.

Another property of text which argument mining approaches should generalize over is topic. Since new topics appear daily on which argument mining approaches are not trained, argument mining approaches should be developed in a topic-generalizable way. Towards this goal, we analyze the coverage of 31 argument corpora across topics using three topic ontologies. The analysis shows that the topics covered by existing argument corpora are biased toward a small subset of easily accessible controversial topics, hinting at the inability of existing approaches to generalize across topics. In addition to corpus construction standards, fostering topic generalizability requires a careful formulation of argument mining tasks. Same side stance classification is a reformulation of stance classification that makes it less dependent on the topic. First experiments on this task show promising results in generalizing across topics.

To be effective at persuading their audience, users of an argument retrieval system should select arguments from the retrieved results based on what frame they emphasize of a controversial topic. An open challenge is to develop an approach to identify the frames of an argument. To this end, we define a frame as a subset of arguments that share an aspect. We operationalize this model via an approach that identifies and removes the topic of arguments before clustering them into frames. We evaluate the approach on a dataset that covers 12,326 frames and show that identifying the topic of an argument and removing it helps to identify its frames.
KW  - Informatik
KW  - Suchmaschine
KW  - Argumentation
KW  - Internet
KW  - argumentation
KW  - controversial topics
KW  - natural language processing
KW  - search engines
Y1  - 2023
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20230626-64037
ER  - 
TY  - THES
A1  - Anderka, Maik
T1  - Analyzing and Predicting Quality Flaws in User-generated Content: The Case of Wikipedia
N2  - Web applications that are based on user-generated content are often criticized for containing low-quality information; a popular example is the online encyclopedia Wikipedia. The major points of criticism pertain to the accuracy, neutrality, and reliability of information. The identification of low-quality information is an important task since for a huge number of people around the world it has become a habit to first visit Wikipedia in case of an information need. Existing research on quality assessment in Wikipedia either investigates only small samples of articles, or else deals with the classification of content into high-quality or low-quality. This thesis goes further, it targets the investigation of quality flaws, thus providing specific indications of the respects in which low-quality content needs improvement. The original contributions of this thesis, which relate to the fields of user-generated content analysis, data mining, and machine learning, can be summarized as follows:

(1) We propose the investigation of quality flaws in Wikipedia based on user-defined cleanup tags. Cleanup tags are commonly used in the Wikipedia community to tag content that has some shortcomings. Our approach is based on the hypothesis that each cleanup tag defines a particular quality flaw.

(2) We provide the first comprehensive breakdown of Wikipedia's quality flaw structure. We present a flaw organization schema, and we conduct an extensive exploratory data analysis which reveals (a) the flaws that actually exist, (b) the distribution of flaws in Wikipedia, and, (c) the extent of flawed content.

(3) We present the first breakdown of Wikipedia's quality flaw evolution. We consider the entire history of the English Wikipedia from 2001 to 2012, which comprises more than 508 million page revisions, summing up to 7.9 TB. Our analysis reveals (a) how the incidence and the extent of flaws have evolved, and, (b) how the handling and the perception of flaws have changed over time.

(4) We are the first who operationalize an algorithmic prediction of quality flaws in Wikipedia. We cast quality flaw prediction as a one-class classification problem, develop a tailored quality flaw model, and employ a dedicated one-class machine learning approach. A comprehensive evaluation based on human-labeled Wikipedia articles underlines the practical applicability of our approach.
KW  - Data Mining
KW  - Machine Learning
KW  - Wikipedia
KW  - User-generated Content Analysis
KW  - Information Quality Assessment
KW  - Quality Flaw Prediction
Y1  - 2013
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20130709-19778
ER  - 
TY  - THES
A1  - Al Khatib, Khalid
T1  - Computational Analysis of Argumentation Strategies
N2  - The computational analysis of argumentation strategies is substantial for many downstream applications. It is required for nearly all kinds of text synthesis, writing assistance, and dialogue-management tools. While various tasks have been tackled in the area of computational argumentation, such as argumentation mining and quality assessment, the task of the computational analysis of argumentation strategies in texts has so far been overlooked.

This thesis principally approaches the analysis of the strategies manifested in the persuasive argumentative discourses that aim for persuasion as well as in the deliberative argumentative discourses that aim for consensus. To this end, the thesis presents a novel view of argumentation strategies for the above two goals. Based on this view, new models for pragmatic and stylistic argument attributes are proposed, new methods for the identification of the modelled attributes have been developed, and a new set of strategy principles in texts according to the identified attributes is presented and explored.

Overall, the thesis contributes to the theory, data, method, and evaluation aspects of the analysis of argumentation strategies. The models, methods, and principles developed and explored in this thesis can be regarded as essential for promoting the applications mentioned above, among others.
KW  - Argumentation
KW  - Natürliche Sprache
KW  - Argumentation Strategies
KW  - Sprachverarbeitung
KW  - Natural Language Processing
KW  - Computational Argumentation
Y1  - 2021
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20210719-44612
ER  - 
TY  - THES
A1  - Bunte, Andreas
T1  - Entwicklung einer ontologiebasierten Beschreibung zur Erhöhung des Automatisierungsgrades in der Produktion
N2  - Die zu beobachtenden kürzeren Produktlebenszyklen und eine schnellere Marktdurchdringung von Produkttechnologien erfordern adaptive und leistungsfähige Produktionsanlagen. Die Adaptivität ermöglicht eine Anpassung der Produktionsanlage an neue Produkte, und die Leistungsfähigkeit der Anlage stellt sicher, dass ausreichend Produkte in kurzer Zeit und zu geringen Kosten hergestellt werden können. Durch eine Modularisierung der Produktionsanlage kann die Adaptivität erreicht werden. Jedoch erfordert heutzutage jede Adaption manuellen Aufwand, z.B. zur Anpassung von proprietären Signalen oder zur Anpassung übergeordneter Funktionen. Dadurch sinkt die Leistungsfähigkeit der Anlage. 

Das Ziel dieser Arbeit ist es, die Interoperabilität in Bezug auf die Informationsverwendung in modularen Produktionsanlagen zu gewährleisten. Dazu werden Informationen durch semantische Modelle beschrieben. Damit wird ein einheitlicher Informationszugriff ermöglicht, und übergeordnete Funktionen erhalten Zugriff auf alle Informationen der Produktionsmodule, unabhängig von dem Typ, dem Hersteller und dem Alter des Moduls. Dadurch entfällt der manuelle Aufwand bei Anpassungen des modularen Produktionssystems, wodurch die Leistungsfähigkeit der Anlage gesteigert und Stillstandszeiten reduziert werden.

Nach dem Ermitteln der Anforderungen an einen Modellierungsformalismus wurden potentielle Formalismen mit den Anforderungen abgeglichen. OWL DL stellte sich als geeigneter Formalismus heraus und wurde für die Erstellung des semantischen Modells in dieser Arbeit verwendet. Es wurde exemplarisch ein semantisches Modell für die drei Anwendungsfälle Interaktion, Orchestrierung und Diagnose erstellt. Durch einen Vergleich der Modellierungselemente von unterschiedlichen Anwendungsfällen wurde die Allgemeingültigkeit des Modells bewertet. Dabei wurde gezeigt, dass die Erreichung eines allgemeinen Modells für technische Anwendungsfälle möglich ist und lediglich einige Hundert Begriffe benötigt.

Zur Evaluierung der erstellten Modelle wurde ein wandlungsfähiges Produktionssystem der SmartFactoryOWL verwendet, an dem die Anwendungsfälle umgesetzt wurden. Dazu wurde eine Laufzeitumgebung erstellt, die die semantischen Modelle der einzelnen Module zu einem Gesamtmodell vereint, Daten aus der Anlage in das Modell überträgt und eine Schnittstelle für die Services bereitstellt. Die Services realisieren übergeordnete Funktionen und verwenden die Informationen des semantischen Modells. In allen drei Anwendungsfällen wurden die semantischen Modelle korrekt zusammengefügt und mit den darin enthaltenen Informationen konnte die Aufgabe des jeweiligen Anwendungsfalles ohne zusätzlichen manuellen Aufwand gelöst werden.
KW  - Ontologie
KW  - Metamodell
KW  - Interoperabilität
KW  - OWL <Informatik>
KW  - Industrie 4.0
Y1  - 2020
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20201215-43156
ER  - 
TY  - THES
A1  - Kiesel, Johannes
T1  - Harnessing Web Archives to Tackle Selected Societal Challenges
N2  - With the growing importance of the World Wide Web, the major challenges our society faces are also increasingly affecting the digital areas of our lives. Some of the associated problems can be addressed by computer science, and some of these specifically by data-driven research. To do so, however, requires to solve open issues related to archive quality and the large volume and variety of the data contained.

This dissertation contributes data, algorithms, and concepts towards leveraging the big data and temporal provenance capabilities of web archives to tackle societal challenges. We selected three such challenges that highlight the central issues of archive quality, data volume, and data variety, respectively:
(1) For the preservation of digital culture, this thesis investigates and improves the automatic quality assurance of the web page archiving process, as well as the further processing of the resulting archive data for automatic analysis.
(2) For the critical assessment of information, this thesis examines large datasets of Wikipedia and news articles and presents new methods for automatically determining quality and bias.
(3) For digital security and privacy, this thesis exploits the variety of content on the web to quantify the security of mnemonic passwords and analyzes the privacy-aware re-finding of the various seen content through private web archives.
N2  - Mit der wachsenden Bedeutung des World Wide Webs betreffen die großen Herausforderungen unserer Gesellschaft zunehmend auch die digitalen Bereiche unseres Lebens. Einige der zugehörigen Probleme können durch die Informatik, und einige von diesen speziell durch datengetriebene Forschung, angegangen werden. Dazu müssen jedoch offene Fragen im Zusammenhang mit der Qualität der Archive und der großen Menge und Vielfalt der enthaltenen Daten gelöst werden. 

Diese Dissertation trägt mit Daten, Algorithmen und Konzepten dazu bei, die große Datenmenge und temporale Protokollierung von Web-Archiven zu nutzen, um gesellschaftliche Herausforderungen zu bewältigen. Wir haben drei solcher Herausforderungen ausgewählt, die die zentralen Probleme der Archivqualität, des Datenvolumens und der Datenvielfalt hervorheben:
(1) Für die Bewahrung der digitalen Kultur untersucht und verbessert diese Arbeit die automatische Qualitätsbestimmung einer Webseiten-Archivierung, sowie die weitere Aufbereitung der dabei entstehenden Archivdaten für automatische Auswertungen.
(2) Für die kritische Bewertung von Information untersucht diese Arbeit große Datensätze an Wikipedia- und Nachrichtenartikeln und stellt neue Verfahren zur Bestimmung der Qualität und Einseitigkeit/Parteilichkeit vor.
(3) Für die digitale Sicherheit und den Datenschutz nutzt diese Arbeit die Vielfalt der Inhalte im Internet, um die Sicherheit von mnemonischen Passwörtern zu quantifizieren, und analysiert das datenschutzbewusste Wiederauffinden der verschiedenen gesehenen Inhalte mit Hilfe von privaten Web-Archiven.
KW  - Informatik
KW  - Internet
KW  - Web archive
Y1  - 2022
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20220622-46602
ER  - 
TY  - THES
A1  - Gollub, Tim
T1  - Information Retrieval for the Digital Humanities
N2  - In ten chapters, this thesis presents information retrieval technology which is tailored to the research activities that arise in the context of corpus-based digital humanities projects. 

The presentation is structured by a conceptual research process that is introduced in Chapter 1. The process distinguishes a set of five research activities: research question generation, corpus acquisition, research question modeling, corpus annotation, and result dissemination. Each of these research activities elicits different information retrieval tasks with special challenges, for which algorithmic approaches are presented after an introduction of the core information retrieval concepts in Chapter 2. 

A vital concept in many of the presented approaches is the keyquery paradigm introduced in Chapter 3, which represents an operation that returns relevant search queries in response to a given set of input documents. Keyqueries are proposed in Chapter 4 for the recommendation of related work, and in Chapter 5 for improving access to aspects hidden in the long tail of search result lists. 

With pseudo-descriptions, a document expansion approach is presented in Chapter 6. The approach improves the retrieval performance for corpora where only bibliographic meta-data is originally available. In Chapter 7, the keyquery paradigm is employed to generate dynamic taxonomies for corpora in an unsupervised fashion.

Chapter 8 turns to the exploration of annotated corpora, and presents scoped facets as a conceptual extension to faceted search systems, which is particularly useful in exploratory search settings. For the purpose of highlighting the major topical differences in a sequence of sub-corpora, an algorithm called topical sequence profiling is presented in Chapter 9.

The thesis concludes with two pilot studies regarding the visualization of (re)search results for the means of successful result dissemination: a metaphoric interpretation of the information nutrition label, as well as the philosophical bodies, which are 3D-printed search results.
N2  - In zehn Kapiteln stellt diese Arbeit Information-Retrieval-Technologien vor, die auf die Forschungsaktivitäten korpusbasierter Digital-Humanities-Projekte zugeschnitten sind. 

Die Arbeit strukturiert sich an Hand eines konzeptionellen Forschungsprozess der in Kapitel 1 vorgestellt wird. Der Prozess gliedert sich in fünf Forschungsaktivitäten: Die Generierung einer Forschungsfrage, die Korpusakquise, die Modellierung der Forschungsfrage, die Annotation des Korpus sowie die Verbreitung der Ergebnisse. Jede dieser Forschungsaktivitäten bringt unterschiedliche Information-Retrieval-Aufgaben mit besonderen Herausforderungen mit sich, für die, nach einer Einführung in die zentralen Information-Retrieval-Konzepte in Kapitel 2, algorithmische Ansätze vorgestellt werden. 

Ein wesentliches Konzept der vorgestellten Ansätze ist das in Kapitel 3 eingeführte Keyquery-Paradigma. Hinter dem Paradigma steht eine Suchoperation, die als Antwort auf eine gegebene Menge von Eingabedokumenten relevante Suchanfragen zurückgibt. Keyqueries werden in Kapitel 4 für die Empfehlung verwandter Arbeiten, in Kapitel 5 für die Verbesserung des Zugangs zu Aspekten im Long Tail von Suchergebnislisten vorgeschlagen. 

Mit Pseudo-Beschreibungen wird in Kapitel 6 ein Ansatz zur Document-Expansion vorgestellt. Der Ansatz verbessert die Suchleistung für Korpora, bei denen ursprünglich nur bibliografische Metadaten vorhanden sind. In Kapitel 7 wird das Keyquery-Paradigma eingesetzt, um auf unüberwachte Weise dynamische Taxonomien für Korpora zu generieren.

Kapitel 8 wendet sich der Exploration von annotierten Korpora zu und stellt Scoped Facets als konzeptionelle Erweiterung von facettierten Suchsystemen vor, die besonders in explorativen Suchszenarien nützlich ist. Um die wichtigsten thematischen Unterschiede und Entwicklungen in einer Sequenz von Sub-Korpora hervorzuheben, wird in Kapitel 9 ein Algorithmus zum Topical Sequence Profiling vorgestellt. 

Die Arbeit schließt mit zwei Pilotstudien zur Visualisierung von Such- bzw. Forschungsergebnissen als Mittel für eine erfolgreiche Ergebnisverbreitung: eine metaphorische Interpretation des Information-Nutrition-Labels, sowie die philosophischen Körper, 3D-gedruckte Suchergebnisse.
KW  - Information Retrieval
KW  - Explorative Suche
KW  - Digital Humanities
KW  - keyqueries
Y1  - 2022
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20220801-46738
ER  - 
TY  - THES
A1  - Lipka, Nedim
T1  - Modeling Non-Standard Text Classification Tasks
N2  - Text classification deals with discovering knowledge in texts and is used for extracting, filtering, or retrieving information in streams and collections. The discovery of knowledge is operationalized by modeling text classification tasks, which is mainly a human-driven engineering process. The outcome of this process, a text classification model, is used to inductively learn a text classification solution from a priori classified examples. The building blocks of modeling text classification tasks cover four aspects: (1) the way examples are represented, (2) the way examples are selected, (3) the way classifiers learn from examples, and (4) the way models are selected.

This thesis proposes methods that improve the prediction quality of text classification solutions for unseen examples, especially for non-standard tasks where standard models do not fit. The original contributions are related to the aforementioned building blocks: (1) Several topic-orthogonal text representations are studied in the context of non-standard tasks and a new representation, namely co-stems, is introduced. (2) A new active learning strategy that goes beyond standard sampling is examined. (3) A new one-class ensemble for improving the effectiveness of one-class classification is proposed. (4) A new model selection framework to cope with subclass distribution shifts that occur in dynamic environments is introduced.
KW  - Text Classification
KW  - Machine Learning
Y1  - 2013
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20130307-18626
ER  - 
TY  - THES
A1  - Völske, Michael
T1  - Retrieval Enhancements for Task-Based Web Search
N2  - The task-based view of web search implies that retrieval should take the user perspective into account. Going beyond merely retrieving the most relevant result set for the current query, the retrieval system should aim to surface results that are actually useful to the task that motivated the query.

This dissertation explores how retrieval systems can better understand and support their users’ tasks from three main angles: First, we study and quantify search engine user behavior during complex writing tasks, and how task success and behavior are associated in such settings. Second, we investigate search engine queries formulated as questions, and explore patterns in a large query log that may help search engines to better support this increasingly prevalent interaction pattern. Third, we propose a novel approach to reranking the search result lists produced by web search engines, taking into account retrieval axioms that formally specify properties of a good ranking.
N2  - Die Task-basierte Sicht auf Websuche impliziert, dass die Benutzerperspektive berücksichtigt werden sollte. Über das bloße Abrufen der relevantesten Ergebnismenge für die aktuelle Anfrage hinaus, sollten Suchmaschinen Ergebnisse liefern, die tatsächlich für die Aufgabe (Task) nützlich sind, die diese Anfrage motiviert hat.

Diese Dissertation untersucht, wie Retrieval-Systeme die Aufgaben ihrer Benutzer besser verstehen und unterstützen können, und leistet Forschungsbeiträge unter drei Hauptaspekten: Erstens untersuchen und quantifizieren wir das Verhalten von Suchmaschinenbenutzern während komplexer Schreibaufgaben, und wie Aufgabenerfolg und Verhalten in solchen Situationen zusammenhängen. Zweitens untersuchen wir Suchmaschinenanfragen, die als Fragen formuliert sind, und untersuchen ein Suchmaschinenlog mit fast einer Milliarde solcher Anfragen auf Muster, die Suchmaschinen dabei helfen können, diesen zunehmend verbreiteten Anfragentyp besser zu unterstützen. Drittens schlagen wir einen neuen Ansatz vor, um die von Web-Suchmaschinen erstellten Suchergebnislisten neu zu sortieren, wobei Retrieval-Axiome berücksichtigt werden, die die Eigenschaften eines guten Rankings formal beschreiben.
KW  - Information Retrieval
Y1  - 2019
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20190709-39422
ER  - 
TY  - THES
A1  - Potthast, Martin
T1  - Technologies for Reusing Text from the Web
N2  - Texts from the web can be reused individually or in large quantities. The former is called text reuse and the latter language reuse. We first present a comprehensive overview of the different ways in which text and language is reused today, and how exactly information retrieval technologies can be applied in this respect. The remainder of the thesis then deals with specific retrieval tasks. In general, our contributions consist of models and algorithms, their evaluation, and for that purpose, large-scale corpus construction.

The thesis divides into two parts. The first part introduces technologies for text reuse detection, and our contributions are as follows: (1) A unified view of projecting-based and embedding-based fingerprinting for near-duplicate detection and the first time evaluation of fingerprint algorithms on Wikipedia revision histories as a new, large-scale corpus of near-duplicates. (2) A new retrieval model for the quantification of cross-language text similarity, which gets by without parallel corpora. We have evaluated the model in comparison to other models on many different pairs of languages. (3) An evaluation framework for text reuse and particularly plagiarism detectors, which consists of tailored detection performance measures and a large-scale corpus of automatically generated and manually written plagiarism cases. The latter have been obtained via crowdsourcing. This framework has been successfully applied to evaluate many different state-of-the-art plagiarism detection approaches within three international evaluation competitions.

The second part introduces technologies that solve three retrieval tasks based on language reuse, and our contributions are as follows: (4) A new model for the comparison of textual and non-textual web items across media, which exploits web comments as a source of information about the topic of an item. In this connection, we identify web comments as a largely neglected information source and introduce the rationale of comment retrieval. (5) Two new algorithms for query segmentation, which exploit web n-grams and Wikipedia as a means of discerning the user intent of a keyword query. Moreover, we crowdsource a new corpus for the evaluation of query segmentation which surpasses existing corpora by two orders of magnitude. (6) A new writing assistance tool called Netspeak, which is a search engine for commonly used language. Netspeak indexes the web in the form of web n-grams as a source of writing examples and implements a wildcard query processor on top of it.
N2  - Texte aus dem Web können einzeln oder in großen Mengen wiederverwendet werden. Ersteres wird Textwiederverwendung und letzteres Sprachwiederverwendung genannt. Zunächst geben wir einen ausführlichen Überblick darüber, auf welche Weise Text und Sprache heutzutage wiederverwendet und wie Technologien des Information Retrieval in diesem Zusammenhang angewendet werden können. In der übrigen Arbeit werden dann spezifische Retrievalaufgaben behandelt. Unsere Beiträge bestehen dabei aus Modellen und Algorithmen, ihrer empirischen Auswertung und der Konstruktion von großen Korpora hierfür.

Die Dissertation ist in zwei Teile gegliedert. Im ersten Teil präsentieren wir Technologien zur Erkennung von Textwiederverwendungen und leisten folgende Beiträge: (1) Ein Überblick über projektionsbasierte- und einbettungsbasierte Fingerprinting-Verfahren für die Erkennung nahezu identischer Texte, sowie die erstmalige Evaluierung einer Reihe solcher Verfahren auf den Revisionshistorien der Wikipedia. (2) Ein neues Modell zum sprachübergreifenden, inhaltlichen Vergleich von Texten. Das Modell basiert auf einem mehrsprachigen Korpus bestehend aus Pärchen themenverwandter Texte, wie zum Beispiel der Wikipedia. Wir vergleichen das Modell in mehreren Sprachen mit herkömmlichen Modellen. (3) Eine Evaluierungsumgebung für Algorithmen zur Plagiaterkennung. Die Umgebung besteht aus Maßen, die die Güte der Erkennung eines Algorithmus' quantifizieren, und einem großen Korpus von Plagiaten. Die Plagiate wurden automatisch generiert sowie mit Hilfe von Crowdsourcing manuell erstellt. Darüber hinaus haben wir zwei Workshops veranstaltet, in denen unsere Evaluierungsumgebung erfolgreich zur Evaluierung aktueller Plagiaterkennungsalgorithmen eingesetzt wurde.

Im zweiten Teil präsentieren wir auf Sprachwiederverwendung basierende Technologien für drei verschiedene Retrievalaufgaben und leisten folgende Beiträge: (4) Ein neues Modell zum medienübergreifenden, inhaltlichen Vergleich von Objekten aus dem Web. Das Modell basiert auf der Auswertung der zu einem Objekt vorliegenden Kommentare. In diesem Zusammenhang identifizieren wir Webkommentare als eine in der Forschung bislang vernachlässigte Informationsquelle und stellen die Grundlagen des Kommentarretrievals vor. (5) Zwei neue Algorithmen zur Segmentierung von Websuchanfragen. Die Algorithmen nutzen Web n-Gramme sowie Wikipedia, um die Intention des Suchenden in einer Suchanfrage festzustellen. Darüber hinaus haben wir mittels Crowdsourcing ein neues Evaluierungskorpus erstellt, das zwei Größenordnungen größer ist als bisherige Korpora. (6) Eine neuartige Suchmaschine, genannt Netspeak, die die Suche nach gebräuchlicher Sprache ermöglicht. Netspeak indiziert das Web als Quelle für gebräuchliche Sprache in der Form von n-Grammen und implementiert eine Wildcardsuche darauf.
KW  - Information Retrieval
KW  - plagiarism detection
KW  - evaluation
KW  - writing assistance
KW  - text reuse
KW  - corpus construction
Y1  - 2012
U6  - http://nbn-resolving.de/urn/resolver.pl?urn:nbn:de:gbv:wim2-20120217-15663
UR  - http://www.webis.de/publications/papers/potthast_2011b.pdf
ER  -