|
" Mono- and Cross-Lingual Paraphrased Text Reuse and Extrinsic Plagiarism Detection "
Sharjeel, Muhammad
Rayson, Paul
Document Type
|
:
|
Latin Dissertation
|
Language of Document
|
:
|
English
|
Record Number
|
:
|
1054871
|
Doc. No
|
:
|
TL53988
|
Main Entry
|
:
|
Sharjeel, Muhammad
|
Title & Author
|
:
|
Mono- and Cross-Lingual Paraphrased Text Reuse and Extrinsic Plagiarism Detection\ Sharjeel, MuhammadRayson, Paul
|
College
|
:
|
Lancaster University (United Kingdom)
|
Date
|
:
|
2020
|
Degree
|
:
|
Ph.D.
|
student score
|
:
|
2020
|
Note
|
:
|
234 p.
|
Abstract
|
:
|
Text reuse is the act of borrowing text (either verbatim or paraphrased) from an earlier written text. It could occur within the same language (mono-lingual) or across languages (cross-lingual) where the reused text is in a different language than the original text. Text reuse and its related problem, plagiarism (the unacknowledged reuse of text), are becoming serious issues in many fields and research shows that paraphrased and especially the cross-lingual cases of reuse are much harder to detect. Moreover, the recent rise in readily available multi-lingual content on the Web and social media has increased the problem to an unprecedented scale. To develop, compare, and evaluate automatic methods for mono- and cross-lingual text reuse and extrinsic (finding portion(s) of text that is reused from the original text) plagiarism detection, standard evaluation resources are of utmost importance. However, previous efforts on developing such resources have mostly focused on English and some other languages. On the other hand, the Urdu language, which is widely spoken and has a large digital footprint, lacks resources in terms of core language processing tools and corpora. With this consideration in mind, this PhD research focuses on developing standard evaluation corpora, methods, and supporting resources to automatically detect mono-lingual (Urdu) and cross-lingual (English-Urdu) cases of text reuse and extrinsic plagiarism. This thesis contributes a mono-lingual (Urdu) text reuse corpus (COUNTER Corpus) that contains real cases of Urdu text reuse at document-level. Another contribution is the development of a mono-lingual (Urdu) extrinsic plagiarism corpus (UPPC Corpus) that contains simulated cases of Urdu paraphrase plagiarism. Evaluation results, by applying a wide range of state-of-the-art mono-lingual methods on both corpora, shows that it is easier to detect verbatim cases than paraphrased ones. Moreover, the performance of these methods decreases considerably on real cases of reuse. A couple of supporting resources are also created to assist methods used in the cross-lingual (English-Urdu) text reuse detection. A large-scale multi-domain English-Urdu parallel corpus (EUPC-20) that contains parallel sentences is mined from the Web and several bi-lingual (English-Urdu) dictionaries are compiled using multiple approaches from different sources.
|
Descriptor
|
:
|
Computer science
|
|
:
|
Linguistics
|
Added Entry
|
:
|
Rayson, Paul
|
Added Entry
|
:
|
Lancaster University (United Kingdom)
|
| |