Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori

This thesis attempts to resolve the problem of verb-noun collocation in English-Arabic Machine Translation engines. This problem can be seen from the semantic ill-formed output produced by current machine translation systems when the wrong verb synonym is chosen for the Arabic translation. Initiall...

全面介绍

书目详细资料
主要作者: Hussein Soori, Hussein Khaled
格式: Thesis
出版: 2015
主题:
_version_ 1849734233943179264
author Hussein Soori, Hussein Khaled
author_facet Hussein Soori, Hussein Khaled
author_sort Hussein Soori, Hussein Khaled
description This thesis attempts to resolve the problem of verb-noun collocation in English-Arabic Machine Translation engines. This problem can be seen from the semantic ill-formed output produced by current machine translation systems when the wrong verb synonym is chosen for the Arabic translation. Initially, this problem starts when a given engine tries to select from a set of polysemous verbs in English to find the equivalent meaning of the verb in Arabic. Mostly, this selection depends on the syntactic environment and verb semantic features serving as selectional restrictions. These selectional restrictions can be very effective when it comes to solving verb polysemantic ambiguity, but lead to a dead end when trying to find the verb that collocates most with the noun in the output Arabic translation. To resolve this problem, this work uses a statistical method inspired by Church et al. (1991) in a prototype designed to retrieve verb-noun collocates in Arabic. The testing data sets for this prototype were chosen from various topics. Two multi-domain corpora in modern standard Arabic were chosen for this work: the Contemporary Corpus of Arabic and the Arabic Corpus by Mourad Abbas. The total number of words in the chosen corpora is 14 million words. The testing data sets were translated by Google, Bing and the prototype designed for this thesis. For the evaluation of these three engines, a simple metric was proposed including a gold standard value for the nounverb collocation in the Arabic translation. According to the evaluation metric, the results showed that Bing scored a verb-noun collocation value of 0.72, Google scored a collocation value of 0.75 and the prototype scored a collocation value of 0.89. The final results showed that the average performance rate for Bing is between 0.65-0.67, the average performance rate for Google is between 0.63-0.85 and the average performance rate for the prototype is between 0.82-0.88. This thesis shows that retrieving the verb that collocates most with the noun in Arabic corpora is a sophisticated task, due to the highly inflectional and agglutinated nature of Arabic where particles, personal pronouns (both for subject and object) and possessive pronouns are agglutinated to the verb in Arabic texts. This task involves two aspects: choosing the query of the search and the distance between the noun and the verb. Choosing the query for the noun and the verb is highly governed by the verb conjugation and noun declension. This requires modifying the search query (stem or lemma) according to the verb features such as tense, number, mood, aspect, etc., and noun features such as, number, gender, definitiveness, case and possessive clitic. Furthermore, decreasing the search distance may lead the search results to ignore some tangible collocation results, but increasing the distance can lead to the inclusion of some noise results. Keywords: English-Arabic machine translation; verb-noun collocation in Arabic; statistical machine translation; collocation retrieval, polysemy and collocation; Arabic corpora
format Thesis
id oai:studentsrepo.um.edu.my:6018
institution Universiti Malaya
publishDate 2015
record_format eprints
spelling oai:studentsrepo.um.edu.my:60182015-11-20T08:00:22Z Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori Hussein Soori, Hussein Khaled P Philology. Linguistics PE English PJ Semitic This thesis attempts to resolve the problem of verb-noun collocation in English-Arabic Machine Translation engines. This problem can be seen from the semantic ill-formed output produced by current machine translation systems when the wrong verb synonym is chosen for the Arabic translation. Initially, this problem starts when a given engine tries to select from a set of polysemous verbs in English to find the equivalent meaning of the verb in Arabic. Mostly, this selection depends on the syntactic environment and verb semantic features serving as selectional restrictions. These selectional restrictions can be very effective when it comes to solving verb polysemantic ambiguity, but lead to a dead end when trying to find the verb that collocates most with the noun in the output Arabic translation. To resolve this problem, this work uses a statistical method inspired by Church et al. (1991) in a prototype designed to retrieve verb-noun collocates in Arabic. The testing data sets for this prototype were chosen from various topics. Two multi-domain corpora in modern standard Arabic were chosen for this work: the Contemporary Corpus of Arabic and the Arabic Corpus by Mourad Abbas. The total number of words in the chosen corpora is 14 million words. The testing data sets were translated by Google, Bing and the prototype designed for this thesis. For the evaluation of these three engines, a simple metric was proposed including a gold standard value for the nounverb collocation in the Arabic translation. According to the evaluation metric, the results showed that Bing scored a verb-noun collocation value of 0.72, Google scored a collocation value of 0.75 and the prototype scored a collocation value of 0.89. The final results showed that the average performance rate for Bing is between 0.65-0.67, the average performance rate for Google is between 0.63-0.85 and the average performance rate for the prototype is between 0.82-0.88. This thesis shows that retrieving the verb that collocates most with the noun in Arabic corpora is a sophisticated task, due to the highly inflectional and agglutinated nature of Arabic where particles, personal pronouns (both for subject and object) and possessive pronouns are agglutinated to the verb in Arabic texts. This task involves two aspects: choosing the query of the search and the distance between the noun and the verb. Choosing the query for the noun and the verb is highly governed by the verb conjugation and noun declension. This requires modifying the search query (stem or lemma) according to the verb features such as tense, number, mood, aspect, etc., and noun features such as, number, gender, definitiveness, case and possessive clitic. Furthermore, decreasing the search distance may lead the search results to ignore some tangible collocation results, but increasing the distance can lead to the inclusion of some noise results. Keywords: English-Arabic machine translation; verb-noun collocation in Arabic; statistical machine translation; collocation retrieval, polysemy and collocation; Arabic corpora 2015 Thesis NonPeerReviewed application/pdf http://studentsrepo.um.edu.my/6018/1/PhD_Thesis_Hussein_K_H_Soori.pdf Hussein Soori, Hussein Khaled (2015) Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori. PhD thesis, University of Malaya. http://studentsrepo.um.edu.my/6018/
spellingShingle P Philology. Linguistics
PE English
PJ Semitic
Hussein Soori, Hussein Khaled
Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori
title Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori
title_full Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori
title_fullStr Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori
title_full_unstemmed Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori
title_short Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori
title_sort using verb noun collocation for disambiguating verb polysemy in english arabic statistical machine translation hussein khaled hussein soori
topic P Philology. Linguistics
PE English
PJ Semitic
url-record http://studentsrepo.um.edu.my/6018/
work_keys_str_mv AT husseinsoorihusseinkhaled usingverbnouncollocationfordisambiguatingverbpolysemyinenglisharabicstatisticalmachinetranslationhusseinkhaledhusseinsoori