MALAY-ENGLISH CODE-SWITCHED SOCIAL MEDIA SENTIMENT CORPUS AND SUPERVISED BENCHMARK
Code-switching between Malay and English is common on social media platforms like Twitter (currently called X) and YouTube, reflecting Malaysia's linguistically diverse society. However, this phenomenon presents a significant challenge for sentiment analysis, as the mixing of languages wi...
| 第一著者: | |
|---|---|
| フォーマット: | 学位論文 |
| 言語: | 英語 英語 英語 |
| 出版事項: |
Universiti Malaysia Sarawak
2025
|
| 主題: | |
| オンライン・アクセス: | http://ir.unimas.my/id/eprint/49648/ |
| Abstract | Abstract here |
| 要約: | Code-switching between Malay and English is common on social media platforms like
Twitter (currently called X) and YouTube, reflecting Malaysia's linguistically diverse
society. However, this phenomenon presents a significant challenge for sentiment analysis,
as the mixing of languages within sentences or phrases increases the likelihood of sentiment
misclassification. To address this issue, this research proposed a methodology consisting of
a corpus construction framework and a supervised benchmark for sentiment classification
on Malay-English code-switched social media data, aimed to improve sentiment
interpretation accuracy within such mixed-language content. The first part of the
methodology focused on constructing a Malay-English code-switched sentiment corpus
using social media data, referred to as MESocSentiment. This corpus consists of 78.23%
neutral, 16.33% positive, and 5.44% negative tweets. Data from MESocSentiment were used
in a supervised machine learning approach that included a bootstrapping technique for
benchmarking sentiment models. The bootstrapping technique was implemented in four
rounds, with training data from the corpus added incrementally to increase the dataset size
at each round. Subsequently, 6,000 tweets with sentiment labels from bootstrapping were
used to train six selected machine learning and deep learning models for benchmarking. The
findings showed that SVM was the best model for classifying code-switched data in terms
of accuracy and AUC-ROC metrics. It achieved 76.55% accuracy, and all its AUC values
ranged between 0.90 and 0.83. Its AUC values for the micro-average and macro-average
ROC curves were 0.90 and 0.84, respectively. Lastly, the MESocSentiment corpus and
selected experiment outputs have been published on GitHub for public use. |
|---|
