The Bridging the Lingustic Gap: Challenges in Building AI Models For Non-Standard Dialects

Authors

  • Rizky Surya Ramadhan Universitas KH. Abdul Chalim
  • Nurul Azizah Ria Kusrini Universitas KH. Abdul Chalim
  • Ardianto Ardianto Institut Agama Islam Daruttaqwa

DOI:

https://doi.org/10.54069/attaqwa.v21i1.978

Keywords:

Non-standard Indonesian, Low-resource dialects, Code-mixing, Natural Language Processing, Language model robustness

Abstract

This study examines the challenges of developing Natural Language Processing (NLP) models for non-standard and low-resource Indonesian dialects, with a focus on code-mixing, slang, and regional variations commonly encountered in digital communication. Using a synthetic dataset (NusaDialect benchmark) for sentiment analysis and Named Entity Recognition (NER), we examined the performance of widely used models, including mBERT, IndoBERT, XLM-RoBERTa, and GPT-4. Quantitative results reveal a significant performance gap when models trained on standard Indonesian are applied to dialectal input, with IndoBERT outperforming mBERT but being surpassed by XLM-RoBERTa. In contrast, GPT-4 demonstrates strong resilience in zero-shot settings. Qualitative error analysis further reveals systematic weaknesses related to out-of-vocabulary slang, code-switching ambiguity, morphological complexity, and pragmatic or culturally embedded expressions. To address these limitations, two mitigation strategies were tested: continued pretraining on social media data and data augmentation with back-translation. Findings indicate that while continued pretraining yields the most significant performance gains, augmentation offers a more balanced trade-off by improving dialectal robustness without degrading performance on formal Indonesian. The study concludes that overcoming these linguistic challenges requires not only technical solutions but also culturally informed approaches. Practical implications extend to AI applications in customer service, social media analysis, and digital governance, where inclusivity and accessibility for diverse language users are essential.

Downloads

Download data is not yet available.

References

Abdalla, M., Wahle, J. P., Ruas, T., Névéol, A., Ducel, F., Mohammad, S. M., & Fort, K. (2023). The Elephant in the Room: Analyzing the Presence of Big Tech in Natural Language Processing Research. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 13141–13160. https://doi.org/10.18653/v1/2023.acl-long.734

Arif, M., Aziz, M. K. N. A., & Arif, M. A. M. (2025). A Recent Study on Islamic Religious Education Teachers’ Competencies in the Digital Age: A Systematic Literature Review. Journal of Education and Learning (EduLearn), 19(2), 587–596.

Bender, E. M., & Friedman, B. (2018). Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041

Blodgett, S. L., Barocas, S., Daumé III, H., & Wallach, H. (2020). Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 5454–5476). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.485

Cahyawijaya, S., Lovenia, H., Aji, A. F., Winata, G., Wilie, B., Koto, F., Mahendra, R., Wibisono, C., Romadhony, A., Vincentio, K., Santoso, J., Moeljadi, D., Wirawan, C., Hudi, F., Wicaksono, M. S., Parmonangan, I., Alfina, I., Putra, I. F., Rahmadani, S., … Purwarianti, A. (2023). NusaCrowd: Open Source Initiative for Indonesian NLP Resources. In A. Rogers, J. Boyd-Graber, & N. Okazaki (Eds.), Findings of the Association for Computational Linguistics: ACL 2023 (pp. 13745–13818). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.findings-acl.868

Çetino?lu, Ö., Schulz, S., & Vu, N. T. (2016). Challenges of Computational Processing of Code-Switching (No. arXiv:1610.02213). arXiv. https://doi.org/10.48550/arXiv.1610.02213

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

Elice, D., Patimah, S., Pahrudin, A., Koderi, Fauzan, A., & Liriwati, F. Y. (2025). Development of Abacus Training Management in the Artificial ?Intelligence Era. Munaddhomah: Jurnal Manajemen Pendidikan Islam, 6(2), 267–280. https://doi.org/10.31538/munaddhomah.v6i2.1719

García, O., & Wei, L. (2014). Translanguaging. Palgrave Macmillan UK. https://doi.org/10.1057/9781137385765

Gumperz, J. J. (1982). Discourse Strategies. Cambridge University Press.

Gururangan, S., Marasovi?, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., & Smith, N. A. (2020). Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8342–8360). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.740

Haspelmath, M., & Sims, A. D. (2010). Understanding Morphology. ResearchGate. https://www.researchgate.net/publication/333317956_Understanding_Morphology_2ed

Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2021). The State and Fate of Linguistic Diversity and Inclusion in the NLP World (No. arXiv:2004.09095). arXiv. https://doi.org/10.48550/arXiv.2004.09095

Khotimah, S. H., Krisnawati, N. M., Abusiri, A., Mubin, F., & Wardi, M. (2024). Development of Virtual Field Trip-Based Learning Model as A Strengthening of Madrasah Student Digital Literacy. Nazhruna: Jurnal Pendidikan Islam, 7(1), 103–121. https://doi.org/10.31538/nzh.v7i1.4532

Kohler, M. (2019). Language education policy in Indonesia: A struggle for unity in diversity. In The Routledge International Handbook of Language Education Policy in Asia. Routledge.

Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. In D. Scott, N. Bel, & C. Zong (Eds.), Proceedings of the 28th International Conference on Computational Linguistics (pp. 757–770). International Committee on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.66

Kurniawan, S., Herlambang, S., Sari, N., Fadian, F., Suratman, B., Nurhidayah, V. A., Naffati, A. K., & Torikoh. (2024). Making Peace with Change: The Effect of GPT Chat Utilization on the Performance of Islamic Religion Teachers in Creating Teaching Modules. Jurnal Pendidikan Agama Islam, 21(2), 492–509. https://doi.org/10.14421/jpai.v21i2.9767

Levinson, S. C. (1983). Pragmatics. Cambridge University Press.

Makrifah, N., & Fauzi, N. (2024). Implementation of Talking Stick Learning Model to Improve English Learning Outcomes in Islamic Elementary School. Fikroh: Jurnal Pemikiran Dan Pendidikan Islam, 17(1), 29–39. https://doi.org/10.37812/fikroh.v17i1.1403

Munawir, M., Alfiana, F., & Pambayun, S. P. (2024). Menyongsong Masa Depan: Transformasi Karakter Siswa Generasi Alpha Melalui Pendidikan Islam yang Berbasis Al-Qur’an. Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 7(1), 1–11. https://doi.org/10.54069/attadrib.v7i1.628

Myers-Scotton, C. (1997). Duelling Languages: Grammatical Structure in Codeswitching. Clarendon Press.

Nurhalisa, N., Rizal, R., Aqil, M., Lagandesa, Y. R., & Fasli, M. (2025). Pengaruh Model Problem Based Learning (PBL) dengan berbantuan Media Wordwall terhadap Hasil Belajar Siswa pada Mata Pelajaran Bahasa Indonesia. Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 8(1), 151–159. https://doi.org/10.54069/attadrib.v8i1.867

Pratapa, A., Bhat, G., Choudhury, M., Sitaram, S., Dandapat, S., & Bali, K. (2018). Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1543–1553). Association for Computational Linguistics. https://doi.org/10.18653/v1/P18-1143

Rahmat, M., Supriadi, U., Fakhruddin, A., Surahman, C., Abdillah, H. T., & Nurjanah, N. (2025). Religiosity and Interfaith Tolerance Among Students in Indonesian Islamic and General Junior Secondary Schools. Jurnal Pendidikan Islam, 11(1), 115–132. https://doi.org/10.15575/jpi.v11i1.44660

Reksiana, Nata, A., Rosyada, D., Rahiem, M. D. H., & Ugli, A. R. R. (2024). Digital Extension of Digital Literacy Competence for Islamic Religious Education Teachers in the Era of Digital Learning. Jurnal Pendidikan Agama Islam, 21(2), 402–420. https://doi.org/10.14421/jpai.v21i2.9719

Rohmiati, E. (2025). The Use of Digital Media in Learning Islamic Religious Education: Opportunities and Challenges. Urwatul Wutsqo: Jurnal Studi Kependidikan Dan Keislaman, 14(1), 33–45. https://doi.org/10.54437/urwatulwutsqo.v14i1.1952

Sain, Z. H., Serban, R., Abdullah, N. B., & Thelma, C. C. (2025). Benefits and Drawbacks of Leveraging ChatGPT to Enhance Writing Skills in Secondary Education. At-Tadzkir: Islamic Education Journal, 4(1), 40–52. https://doi.org/10.59373/attadzkir.v4i1.79

Simanjuntak, M. B., Rafli, Z., & Utami, S. R. (2025). Elevating Vocational Student Competence: The Crucial Need for English Literacy Competence. Jurnal Ilmiah Peuradeun, 13(1), 721–744. https://doi.org/10.26811/peuradeun.v13i1.1109

Sodikin, S. (2024). Transformasi Pendidikan Agama Islam Melalui Artificial Intelligent (AI): Upaya Meningkatkan Kemampuan Berpikir Kritis Mahasiswa. Academicus: Journal of Teaching and Learning, 3(2), 78–89. https://doi.org/10.59373/academicus.v3i2.65

Sormin, D., Siregar, I., Rambe, N., Siregar, R., Lubis, J. N., & Kholijah, A. (2025). Implementation of the Ismubaris Curriculum (Islamic Studies, Muhammadiyah Ideology, Arabic, and English). Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 8(2), 464–473. https://doi.org/10.54069/attadrib.v8i2.920

Sukabdi, Z. A., Sofanudin, A., Munajat, M., Mulyana, M., & Budiyanto, S. (2025). The Challenge of Terrorism Regeneration: What Schools Do Terrorist Offenders Select for Their Children? Ulumuna, 29(1), 102–128. https://doi.org/10.20414/ujis.v29i1.1061

Syukur, F., Maghfurin, A., Marhamah, U., & Jehwae, P. (2024). Integration of Artificial Intelligence in Islamic Higher Education: Comparative Responses between Indonesia and Thailand. Nazhruna: Jurnal Pendidikan Islam, 7(3), 531–553. https://doi.org/10.31538/nzh.v7i3.13

Topuha, O. K., Rizal, R., Aqil, M., Gagaramusu, Y. B. M., & Fasli, M. (2025). Pengaruh Penggunaan Media Digital Terhadap Hasil Belajar Siswa Pada Mata Pelajaran Bahasa Indonesia di Sekolah Dasar. Attadrib: Jurnal Pendidikan Guru Madrasah Ibtidaiyah, 8(1), 174–183. https://doi.org/10.54069/attadrib.v8i1.866

Winata, G. I., Lin, Z., & Fung, P. (2019). Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition. In I. Augenstein, S. Gella, S. Ruder, K. Kann, B. Can, J. Welbl, A. Conneau, X. Ren, & M. Rei (Eds.), Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) (pp. 181–186). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4320

Wulandari, F., Widyaningrum, N., Sa’ida, N., & Masturoh, U. (2025). Meningkatkan Kemampuan Bahasa Anak Usia Dini melalui Pembelajaran Multimedia Interaktif Berbasis AR dan VR. Academicus: Journal of Teaching and Learning, 4(1), 61–70. https://doi.org/10.59373/academicus.v4i1.86

Zein, S. (2020). Language policy in superdiverse Indonesia. ResearchGate. https://www.researchgate.net/publication/340175378_Zein_-_2020_-_Language_policy_in_superdiverse_Indonesia_-_Chapter_1_copy

Downloads

Published

2025-09-02

How to Cite

Ramadhan, R. S. ., Ria Kusrini, N. A. ., & Ardianto, A. (2025). The Bridging the Lingustic Gap: Challenges in Building AI Models For Non-Standard Dialects. Attaqwa: Jurnal Ilmu Pendidikan Islam, 21(1), 69–81. https://doi.org/10.54069/attaqwa.v21i1.978