Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross-Linguistic Contextual Understanding IEEE Access
- Mohammad Tareq University of Dhaka
- Md Fokhrul Islam University of Dhaka
- Swakshar Deb University of Dhaka
- Sejuti Rahman University of Dhaka
- Abdullah Al Mahmud University of Dhaka
Abstract
In today’s digital world, automated sentiment analysis from online reviews can contribute to a wide variety of decision-making processes. One example is examining typical perceptions of a product based on customer feedbacks to have a better understanding of consumer expectations, which can help enhance everything from customer service to product offerings. Online review comments, on the other hand, frequently mix different languages, use non-native scripts and do not adhere to strict grammar norms. For a low-resource language like Bangla, the lack of annotated code-mixed data makes automated sentiment analysis more challenging. To address this, we collect online reviews of different products and construct an annotated Bangla-English code mix (BE-CM) dataset. On our sentiment corpus, we also compare several alternative models from the existing literature. We present a simple but effective data augmentation method that can be utilized with existing word embedding algorithms without the need for a parallel corpus to improve cross-lingual contextual understanding. Our experimental results suggest that training word embedding models (e.g., Word2vec, FastText) with our data augmentation strategy can help the model in capturing the cross-lingual relationship for code-mixed sentences, thereby improving the overall performance of existing classifiers in both supervised learning and zero-shot cross-lingual adaptability. With extensive experimentations, we found that XGBoost with Fasttext embedding trained on our proposed data augmentation method outperforms other alternative models in automated sentiment analysis on code-mixed BanglaEnglish dataset, with a weighted F1 score of 87%.
The augmentation method explain more mathmatically in the algorithm below. Each of the step explain in the paper in more details.
Overall Results
Below you will find quantitative results for exercise assement in comparison with the previous methods.
Below you will find visualization of t-SNE of three setup. Blue color indicates English words, whereas red color denotes the Bangla-counterpart. (a) word vector representation when we convert every word into it’s monolingual counterpart, (b) data augmentation with random word selection as proposed in previous works, (c) our proposed data augmentation with sampling rate, r = 1, 2 and 3..