The Bridging the Lingustic Gap: Challenges in Building AI Models For Non-Standard Dialects
DOI:
https://doi.org/10.54069/attaqwa.v21i1.978Keywords:
Non-standard Indonesian, Low-resource dialects, Code-mixing, Natural Language Processing, Language model robustnessAbstract
This study examines the challenges of developing Natural Language Processing (NLP) models for non-standard and low-resource Indonesian dialects, with a focus on code-mixing, slang, and regional variations commonly encountered in digital communication. Using a synthetic dataset (NusaDialect benchmark) for sentiment analysis and Named Entity Recognition (NER), we examined the performance of widely used models, including mBERT, IndoBERT, XLM-RoBERTa, and GPT-4. Quantitative results reveal a significant performance gap when models trained on standard Indonesian are applied to dialectal input, with IndoBERT outperforming mBERT but being surpassed by XLM-RoBERTa. In contrast, GPT-4 demonstrates strong resilience in zero-shot settings. Qualitative error analysis further reveals systematic weaknesses related to out-of-vocabulary slang, code-switching ambiguity, morphological complexity, and pragmatic or culturally embedded expressions. To address these limitations, two mitigation strategies were tested: continued pretraining on social media data and data augmentation with back-translation. Findings indicate that while continued pretraining yields the most significant performance gains, augmentation offers a more balanced trade-off by improving dialectal robustness without degrading performance on formal Indonesian. The study concludes that overcoming these linguistic challenges requires not only technical solutions but also culturally informed approaches. Practical implications extend to AI applications in customer service, social media analysis, and digital governance, where inclusivity and accessibility for diverse language users are essential.
Downloads
References
Abdalla, M., Wahle, J. P., Ruas, T., Névéol, A., Ducel, F., Mohammad, S. M., & Fort, K. (2023). The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 13141–13160. https://doi.org/10.18653/v1/2023.acl-long.734
Arif, M., Aziz, M. K. N. A., & Arif, M. A. M. (2025). A Recent Study on Islamic Religious Education Teachers’ Competencies in the Digital Age: A Systematic Literature Review. Journal of Education and Learning (EduLearn), 19(2), 587–596.
Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041
Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485
Cahyawijaya, S., Lovenia, H., Aji, A. F., Winata, G., Wilie, B., Koto, F., Mahendra, R., Wibisono, C., Romadhony, A., Vincentio, K., Santoso, J., Moeljadi, D., Wirawan, C., Hudi, F., Wicaksono, M. S., Parmonangan, I., Alfina, I., Putra, I. F., Rahmadani, S., … Purwarianti, A. (2023). NusaCrowd: Open Source Initiative for Indonesian NLP Resources. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 13745–13818). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.868
Çetino?lu, Ö., Schulz, S., & Vu, N. T. (2016). Challenges of Computational Processing of Code-Switching (No. arXiv:1610.02213). arXiv. https://doi.org/10.48550/arXiv.1610.02213
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Elice, D., Patimah, S., Pahrudin, A., Koderi, Fauzan, A., & Liriwati, F. Y. (2025). Development of Abacus Training Management in the Artificial ?Intelligence Era. Munaddhomah: Jurnal Manajemen Pendidikan Islam, 6(2), 267–280. https://doi.org/10.31538/munaddhomah.v6i2.1719
García, O., & Wei, L. (2014). Translanguaging. Palgrave Macmillan UK. https://doi.org/10.1057/9781137385765
Gumperz, J. J. (1982). Discourse Strategies. Cambridge University Press.
Gururangan, S., Marasovi?, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8342–8360). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.740
Haspelmath, M., & Sims, A. D. (2010). Understanding Morphology. ResearchGate. https://www.researchgate.net/publication/333317956_Understanding_Morphology_2ed
Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2021). The State and Fate of Linguistic Diversity and Inclusion in the NLP World (No. arXiv:2004.09095). arXiv. https://doi.org/10.48550/arXiv.2004.09095
Khotimah, S. H., Krisnawati, N. M., Abusiri, A., Mubin, F., & Wardi, M. (2024). Development of Virtual Field Trip-Based Learning Model as A Strengthening of Madrasah Student Digital Literacy. Nazhruna: Jurnal Pendidikan Islam, 7(1), 103–121. https://doi.org/10.31538/nzh.v7i1.4532
Kohler, M. (2019). Language education policy in Indonesia: A struggle for unity in diversity. In The Routledge International Handbook of Language Education Policy in Asia. Routledge.
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 757–770). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.66
Kurniawan, S., Herlambang, S., Sari, N., Fadian, F., Suratman, B., Nurhidayah, V. A., Naffati, A. K., & Torikoh. (2024). Making Peace with Change: The Effect of GPT Chat Utilization on the Performance of Islamic Religion Teachers in Creating Teaching Modules. Jurnal Pendidikan Agama Islam, 21(2), 492–509. https://doi.org/10.14421/jpai.v21i2.9767
Levinson, S. C. (1983). Pragmatics. Cambridge University Press.
Makrifah, N., & Fauzi, N. (2024). Implementation of Talking Stick Learning Model to Improve English Learning Outcomes in Islamic Elementary School. Fikroh: Jurnal Pemikiran Dan Pendidikan Islam, 17(1), 29–39. https://doi.org/10.37812/fikroh.v17i1.1403
Munawir, M., Alfiana, F., & Pambayun, S. P. (2024). Menyongsong Masa Depan: Transformasi Karakter Siswa Generasi Alpha Melalui Pendidikan Islam yang Berbasis Al-Qur’an. Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 7(1), 1–11. https://doi.org/10.54069/attadrib.v7i1.628
Myers-Scotton, C. (1997). Duelling Languages: Grammatical Structure in Codeswitching. Clarendon Press.
Nurhalisa, N., Rizal, R., Aqil, M., Lagandesa, Y. R., & Fasli, M. (2025). Pengaruh Model Problem Based Learning (PBL) dengan berbantuan Media Wordwall terhadap Hasil Belajar Siswa pada Mata Pelajaran Bahasa Indonesia. Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 8(1), 151–159. https://doi.org/10.54069/attadrib.v8i1.867
Pratapa, A., Bhat, G., Choudhury, M., Sitaram, S., Dandapat, S., & Bali, K. (2018). Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1543–1553). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1143
Rahmat, M., Supriadi, U., Fakhruddin, A., Surahman, C., Abdillah, H. T., & Nurjanah, N. (2025). Religiosity and Interfaith Tolerance Among Students in Indonesian Islamic and General Junior Secondary Schools. Jurnal Pendidikan Islam, 11(1), 115–132. https://doi.org/10.15575/jpi.v11i1.44660
Reksiana, Nata, A., Rosyada, D., Rahiem, M. D. H., & Ugli, A. R. R. (2024). Digital Extension of Digital Literacy Competence for Islamic Religious Education Teachers in the Era of Digital Learning. Jurnal Pendidikan Agama Islam, 21(2), 402–420. https://doi.org/10.14421/jpai.v21i2.9719
Rohmiati, E. (2025). The Use of Digital Media in Learning Islamic Religious Education: Opportunities and Challenges. Urwatul Wutsqo: Jurnal Studi Kependidikan Dan Keislaman, 14(1), 33–45. https://doi.org/10.54437/urwatulwutsqo.v14i1.1952
Sain, Z. H., Serban, R., Abdullah, N. B., & Thelma, C. C. (2025). Benefits and Drawbacks of Leveraging ChatGPT to Enhance Writing Skills in Secondary Education. At-Tadzkir: Islamic Education Journal, 4(1), 40–52. https://doi.org/10.59373/attadzkir.v4i1.79
Simanjuntak, M. B., Rafli, Z., & Utami, S. R. (2025). Elevating Vocational Student Competence: The Crucial Need for English Literacy Competence. Jurnal Ilmiah Peuradeun, 13(1), 721–744. https://doi.org/10.26811/peuradeun.v13i1.1109
Sodikin, S. (2024). Transformasi Pendidikan Agama Islam Melalui Artificial Intelligent (AI): Upaya Meningkatkan Kemampuan Berpikir Kritis Mahasiswa. Academicus: Journal of Teaching and Learning, 3(2), 78–89. https://doi.org/10.59373/academicus.v3i2.65
Sormin, D., Siregar, I., Rambe, N., Siregar, R., Lubis, J. N., & Kholijah, A. (2025). Implementation of the Ismubaris Curriculum (Islamic Studies, Muhammadiyah Ideology, Arabic, and English). Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 8(2), 464–473. https://doi.org/10.54069/attadrib.v8i2.920
Sukabdi, Z. A., Sofanudin, A., Munajat, M., Mulyana, M., & Budiyanto, S. (2025). The Challenge of Terrorism Regeneration: What Schools Do Terrorist Offenders Select for Their Children? Ulumuna, 29(1), 102–128. https://doi.org/10.20414/ujis.v29i1.1061
Syukur, F., Maghfurin, A., Marhamah, U., & Jehwae, P. (2024). Integration of Artificial Intelligence in Islamic Higher Education: Comparative Responses between Indonesia and Thailand. Nazhruna: Jurnal Pendidikan Islam, 7(3), 531–553. https://doi.org/10.31538/nzh.v7i3.13
Topuha, O. K., Rizal, R., Aqil, M., Gagaramusu, Y. B. M., & Fasli, M. (2025). Pengaruh Penggunaan Media Digital Terhadap Hasil Belajar Siswa Pada Mata Pelajaran Bahasa Indonesia di Sekolah Dasar. Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 8(1), 174–183. https://doi.org/10.54069/attadrib.v8i1.866
Winata, G. I., Lin, Z., & Fung, P. (2019). Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition. In I. Augenstein, S. Gella, S. Ruder, K. Kann, B. Can, J. Welbl, A. Conneau, X. Ren, & M. Rei (Eds.), Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) (pp. 181–186). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4320
Wulandari, F., Widyaningrum, N., Sa’ida, N., & Masturoh, U. (2025). Meningkatkan Kemampuan Bahasa Anak Usia Dini melalui Pembelajaran Multimedia Interaktif Berbasis AR dan VR. Academicus: Journal of Teaching and Learning, 4(1), 61–70. https://doi.org/10.59373/academicus.v4i1.86
Zein, S. (2020). Language policy in superdiverse Indonesia. ResearchGate. https://www.researchgate.net/publication/340175378_Zein_-_2020_-_Language_policy_in_superdiverse_Indonesia_-_Chapter_1_copy
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rizky Surya Ramadhan, Nurul Azizah Ria Kusrini, Ardianto Ardianto

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.