MALAY-ENGLISH CODE-SWITCHED SOCIAL MEDIA SENTIMENT CORPUS AND SUPERVISED BENCHMARK

Code-switching between Malay and English is common on social media platforms like Twitter (currently called X) and YouTube, reflecting Malaysia's linguistically diverse society. However, this phenomenon presents a significant challenge for sentiment analysis, as the mixing of languages wi...

詳細記述

書誌詳細
第一著者: Afifah Binti, Mohd Shamsuddin
フォーマット: 学位論文
言語:英語
英語
英語
出版事項: Universiti Malaysia Sarawak 2025
主題:
オンライン・アクセス:http://ir.unimas.my/id/eprint/49648/
Abstract Abstract here
その他の書誌記述
要約:Code-switching between Malay and English is common on social media platforms like Twitter (currently called X) and YouTube, reflecting Malaysia's linguistically diverse society. However, this phenomenon presents a significant challenge for sentiment analysis, as the mixing of languages within sentences or phrases increases the likelihood of sentiment misclassification. To address this issue, this research proposed a methodology consisting of a corpus construction framework and a supervised benchmark for sentiment classification on Malay-English code-switched social media data, aimed to improve sentiment interpretation accuracy within such mixed-language content. The first part of the methodology focused on constructing a Malay-English code-switched sentiment corpus using social media data, referred to as MESocSentiment. This corpus consists of 78.23% neutral, 16.33% positive, and 5.44% negative tweets. Data from MESocSentiment were used in a supervised machine learning approach that included a bootstrapping technique for benchmarking sentiment models. The bootstrapping technique was implemented in four rounds, with training data from the corpus added incrementally to increase the dataset size at each round. Subsequently, 6,000 tweets with sentiment labels from bootstrapping were used to train six selected machine learning and deep learning models for benchmarking. The findings showed that SVM was the best model for classifying code-switched data in terms of accuracy and AUC-ROC metrics. It achieved 76.55% accuracy, and all its AUC values ranged between 0.90 and 0.83. Its AUC values for the micro-average and macro-average ROC curves were 0.90 and 0.84, respectively. Lastly, the MESocSentiment corpus and selected experiment outputs have been published on GitHub for public use.