An automatic diacritization algorithm for undiacritized Arabic text

Modern Standard Arabic (MSA) is used today in most written and some spoken media. It is, however, not the native dialect of any country. Recently, the rate of the written dialectal Arabic text increased dramatically. Most of these texts have been written in the Egyptian dialectal, as it is considere...

Full description

Bibliographic Details
Main Author: Zayyan, Ayman Ahmad Muhammad
Format: Thesis
Language:English
English
Published: 2017
Subjects:
Online Access:https://etd.uum.edu.my/6822/1/s815357_01.pdf
https://etd.uum.edu.my/6822/2/s815357_02.pdf
https://etd.uum.edu.my/6822/
Abstract Abstract here
_version_ 1855353600593625088
author Zayyan, Ayman Ahmad Muhammad
author_facet Zayyan, Ayman Ahmad Muhammad
author_sort Zayyan, Ayman Ahmad Muhammad
description Modern Standard Arabic (MSA) is used today in most written and some spoken media. It is, however, not the native dialect of any country. Recently, the rate of the written dialectal Arabic text increased dramatically. Most of these texts have been written in the Egyptian dialectal, as it is considered the most widely used dialect and understandable throughout the Middle East. Like other Semitic languages, in written Arabic, short vowels are not written, but are represented by diacritic marks. Nonetheless, these marks are not used in most of the modern Arabic texts (for example books and newspapers). The absence of diacritic marks creates a huge ambiguity, as the un-diacritized word may correspond to more than one correct diacritization (vowelization) form. Hence, the aim of this research is to reduce the ambiguity of the absences of diacritic marks using hybrid algorithm with significantly higher accuracy than the state-of-the-art systems for MSA. Moreover, this research is to implement and evaluate the accuracy of the algorithm for dialectal Arabic text. The design of the proposed algorithm based on two main techniques as follows: statistical n-gram along with maximum likelihood estimation and morphological analyzer. Merging the word, morpheme, and letter levels with their sub-models together into one platform in order to improve the automatic diacritization accuracy is the proposition of this research. Moreover, by utilizing the feature of the case ending diacritization, which is ignoring the diacritic mark on the last letter of the word, shows a significant error improvement. The reason for this remarkable improvement is that the Arabic language prohibits adding diacritic marks over some letters. The hybrid algorithm demonstrated a good performance of 97.9% when applied to MSA corpora (Tashkeela), 97.1% when applied on LDC’s Arabic Treebank-Part 3 v1.0 and 91.8% when applied to Egyptian dialectal corpus (CallHome). The main contribution of this research is the hybrid algorithm for automatic diacritization of undiacritized MSA text and dialectal Arabic text. The proposed algorithm applied and evaluated on Egyptian colloquial dialect, the most widely dialect understood and used throughout the Arab world, which is considered as first time based on the literature review.
format Thesis
id oai:etd.uum.edu.my:6822
institution Universiti Utara Malaysia
language English
English
publishDate 2017
record_format EPrints
record_pdf Abstract
spelling oai:etd.uum.edu.my:68222021-08-18T08:45:51Z https://etd.uum.edu.my/6822/ An automatic diacritization algorithm for undiacritized Arabic text Zayyan, Ayman Ahmad Muhammad T58.5-58.64 Information technology Modern Standard Arabic (MSA) is used today in most written and some spoken media. It is, however, not the native dialect of any country. Recently, the rate of the written dialectal Arabic text increased dramatically. Most of these texts have been written in the Egyptian dialectal, as it is considered the most widely used dialect and understandable throughout the Middle East. Like other Semitic languages, in written Arabic, short vowels are not written, but are represented by diacritic marks. Nonetheless, these marks are not used in most of the modern Arabic texts (for example books and newspapers). The absence of diacritic marks creates a huge ambiguity, as the un-diacritized word may correspond to more than one correct diacritization (vowelization) form. Hence, the aim of this research is to reduce the ambiguity of the absences of diacritic marks using hybrid algorithm with significantly higher accuracy than the state-of-the-art systems for MSA. Moreover, this research is to implement and evaluate the accuracy of the algorithm for dialectal Arabic text. The design of the proposed algorithm based on two main techniques as follows: statistical n-gram along with maximum likelihood estimation and morphological analyzer. Merging the word, morpheme, and letter levels with their sub-models together into one platform in order to improve the automatic diacritization accuracy is the proposition of this research. Moreover, by utilizing the feature of the case ending diacritization, which is ignoring the diacritic mark on the last letter of the word, shows a significant error improvement. The reason for this remarkable improvement is that the Arabic language prohibits adding diacritic marks over some letters. The hybrid algorithm demonstrated a good performance of 97.9% when applied to MSA corpora (Tashkeela), 97.1% when applied on LDC’s Arabic Treebank-Part 3 v1.0 and 91.8% when applied to Egyptian dialectal corpus (CallHome). The main contribution of this research is the hybrid algorithm for automatic diacritization of undiacritized MSA text and dialectal Arabic text. The proposed algorithm applied and evaluated on Egyptian colloquial dialect, the most widely dialect understood and used throughout the Arab world, which is considered as first time based on the literature review. 2017 Thesis NonPeerReviewed text en https://etd.uum.edu.my/6822/1/s815357_01.pdf text en https://etd.uum.edu.my/6822/2/s815357_02.pdf Zayyan, Ayman Ahmad Muhammad (2017) An automatic diacritization algorithm for undiacritized Arabic text. Masters thesis, Universiti Utara Malaysia.
spellingShingle T58.5-58.64 Information technology
Zayyan, Ayman Ahmad Muhammad
An automatic diacritization algorithm for undiacritized Arabic text
thesis_level Master
title An automatic diacritization algorithm for undiacritized Arabic text
title_full An automatic diacritization algorithm for undiacritized Arabic text
title_fullStr An automatic diacritization algorithm for undiacritized Arabic text
title_full_unstemmed An automatic diacritization algorithm for undiacritized Arabic text
title_short An automatic diacritization algorithm for undiacritized Arabic text
title_sort automatic diacritization algorithm for undiacritized arabic text
topic T58.5-58.64 Information technology
url https://etd.uum.edu.my/6822/1/s815357_01.pdf
https://etd.uum.edu.my/6822/2/s815357_02.pdf
https://etd.uum.edu.my/6822/
work_keys_str_mv AT zayyanaymanahmadmuhammad anautomaticdiacritizationalgorithmforundiacritizedarabictext
AT zayyanaymanahmadmuhammad automaticdiacritizationalgorithmforundiacritizedarabictext