Automation of technological and business processes

ISSN-print: 2312-3125
ISSN-online: 2312-931X
ISO: 26324:2012
Архiви

LITHUANIAN HATE SPEECH CLASSIFICATION USING DEEP LEARNING METHODS

##plugins.themes.bootstrap3.article.main##

Eglė Kankevičiūtė
Milita Songailaitė
Bohdan Zhyhun
Justina Mandravickaitė

Анотація

Анотація. Постійно зростаюча кількість онлайн-контенту та можливості для кожного висловити свою думку в Інтернеті призводять до частих зустрічей із соціальними проблемами: залякуванням, образами та ворожнечею. Деякі онлайн-портали вживають заходів, щоб зупинити це, наприклад, більше не дозволяють анонімно створювати коментарі користувачів, усувають можливість коментувати під статтями, а деякі портали наймають модераторів, які виявляють і усувають мову ненависті. Однак, враховуючи велику кількість коментарів, для виконання цієї роботи потрібна відповідна кількість людей. Вирішенням цієї проблеми може стати стрімкий розвиток штучного інтелекту в області мовних технологій. Автоматизоване виявлення мови ворожнечі дозволить керувати постійно зростаючою кількістю онлайн-контенту, тому ми повідомляємо про класифікацію мови ворожнечі для литовської мови за допомогою глибокого навчання.

Ключові слова:
глибоке навчання,, трансформери,, мова ненависті,, класифікація тексту, мова ворожнечі

##plugins.themes.bootstrap3.article.details##

Як цитувати
Kankevičiūtė, E., Songailaitė, M., Zhyhun, B., & Mandravickaitė, J. (2023). LITHUANIAN HATE SPEECH CLASSIFICATION USING DEEP LEARNING METHODS. Automation of Technological and Business Processes, 15(3), 20-29. https://doi.org/10.15673/atbp.v15i3.2621
Розділ
ТЕХНІЧНІ ЗАСОБИ І ІНФОРМАЦІЙНІ ТЕХНОЛОГІЇ У СИСТЕМАХ УПРАВЛІННЯ

Посилання

B. Kalsnes and K. A. Ihlebæk, “Hiding hate speech: political moderation on Facebook,” Media, Culture and Society, vol. 43, no. 2, pp. 326–342, 2021, doi: 10.1177/0163443720957562.
[2] J. C. Pereira-Kohatsu, L. Quijano-Sánchez, F. Liberatore, and M. Camacho-Collados, “Detecting and monitoring hate speech in twitter,” Sensors (Switzerland), vol. 19, no. 21, pp. 1–37, 2019, doi: 10.3390/s19214654.
[3] C. A. Martinez, J.-W. van Prooijen, and P. A. M. Van Lange, “Hate: Toward understanding its distinctive features across interpersonal and intergroup targets.,” Emotion, vol. 22, no. 1, p. 46, 2022.
[4] F. Poletto, V. Basile, M. Sanguinetti, C. Bosco, and V. Patti, “Resources and benchmark corpora for hate speech detection: a systematic review,” Language Resources and Evaluation, vol. 55, no. 2, pp. 477–523, 2021, doi: 10.1007/s10579-020-09502-8.
[5] L. Yuan, T. Wang, G. Ferraro, H. Suominen, and M.-A. Rizoiu, “Transfer Learning for Hate Speech Detection in Social Media,” 2019.
[6] S. Stecklow, “Why Facebook is losing the war on hate speech in Myanmar,” 2018. https://www.reuters.com/investigates/special-report/myanmar-facebook-hate/ (accessed Jan. 24, 2023).
[7] Z. Zhang, D. Robinson, and J. Tepper, “Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network,” in The Semantic Web, A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, A. Tordai, and M. Alam, Eds., Cham: Springer International Publishing, 2018, pp. 745–760.
[8] J. Cheng, C. Danescu-Niculescu-Mizil, and J. Leskovec, “Antisocial behavior in online discussion communities,” Proceedings of the 9th International Conference on Web and Social Media, ICWSM 2015, pp. 61–70, 2015, doi: 10.1609/icwsm.v9i1.14583.
[9] Z. Waseem, “Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter,” in Proceedings of the First Workshop on {NLP} and Computational Social Science, Austin, Texas: Association for Computational Linguistics, 2016, pp. 138–142. doi: 10.18653/v1/W16-5618.
[10] D. Chatzakou, N. Kourtellis, J. Blackburn, E. De Cristofaro, G. Stringhini, and A. Vakali, “Mean birds: Detecting aggression and bullying on Twitter,” WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference, pp. 13–22, 2017, doi: 10.1145/3091478.3091487.
[11] E. Wulczyn, N. Thain, and L. Dixon, “Ex machina: Personal attacks seen at scale,” 26th International World Wide Web Conference, WWW 2017, pp. 1391–1399, 2017, doi: 10.1145/3038912.3052591.
[12] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Davidson,” Proceedings of the 11th International Conference on Web and Social Media, ICWSM 2017, no. Icwsm, pp. 512–515, 2017.
[13] E. F. Unsvåg and B. Gambäck, “The Effects of User Features on Twitter Hate Speech Detection,” 2nd Workshop on Abusive Language Online - Proceedings of the Workshop, co-located with EMNLP 2018, no. 2012, pp. 75–85, 2018, doi: 10.18653/v1/w18-5110.
[14] S. Modha, T. Mandl, P. Majumder, and D. Patel, “Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in indo-european languages,” CEUR Workshop Proceedings, vol. 2517, pp. 167–190, 2019.
[15] R. Kumar, B. Lahiri, A. K. Ojha, and A. Bansal, “ComMA@FIRE 2020: Exploring multilingual joint training across different classification tasks,” CEUR Workshop Proc, vol. 2826, pp. 823–828, 2020.
[16] J. M. Struß, M. Siegel, J. Ruppenhofer, M. Wiegand, and M. Klenner, “Overview of GermEval task 2, 2019 shared task on the identification of offensive language,” Proceedings of the 15th Conference on Natural Language Processing, KONVENS 2019, pp. 354–365, 2020.
[17] V. Basile et al., “SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter,” NAACL HLT 2019 - International Workshop on Semantic Evaluation, SemEval 2019, Proceedings of the 13th Workshop, pp. 54–63, 2019, doi: 10.18653/v1/s19-2007.
[18] B. Vidgen, A. Harris, D. Nguyen, R. Tromble, S. Hale, and H. Margetts, “Challenges and frontiers in abusive content detection,” no. Section 2, pp. 80–93, 2019, doi: 10.18653/v1/w19-3509.
[19] K. Dinakar, B. Jones, C. Havasi, H. Lieberman, and R. Picard, “Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying,” ACM Trans. Interact. Intell. Syst., vol. 2, no. 3, 2012, doi: 10.1145/2362394.2362400.
[20] H. Mubarak, K. Darwish, and W. Magdy, “Abusive Language Detection on {A}rabic Social Media,” in Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada: Association for Computational Linguistics, 2017, pp. 52–56. doi: 10.18653/v1/W17-3008.
[21] S. Hassan, Y. Samih, H. Mubarak, and A. Abdelali, “{ALT} at {S}em{E}val-2020 Task 12: {A}rabic and {E}nglish Offensive Language Identification in Social Media,” in Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (online): International Committee for Computational Linguistics, 2020, pp. 1891–1897. doi: 10.18653/v1/2020.semeval-1.249.
[22] H. Alami, S. O. El Alaoui, A. Benlahbib, and N. En-Nahnahi, “LISAC FSDM-USMBA Team at SemEval-2020 Task 12: Overcoming AraBERT’s pretrain-finetune discrepancy for Arabic offensive language identification,” 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings, pp. 2080–2085, 2020, doi: 10.18653/v1/2020.semeval-1.275.
[23] S. Sikora, B. Hurley, and A. G. Tharakan, “Automation with intelligence,” Deloitte, p. 28, 2019.
[24] S. Wang, J. Liu, X. Ouyang, and Y. Sun, “Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification using Pre-trained Language Models,” Online, 2020.
[25] A. Ozdemir and R. Yeniterzi, “{SU}-{NLP} at {S}em{E}val-2020 Task 12: Offensive Language {I}dentifi{C}ation in {T}urkish Tweets,” in Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (online): International Committee for Computational Linguistics, 2020, pp. 2171–2176. doi: 10.18653/v1/2020.semeval-1.288.
[26] H. Ahn, J. Sun, C. Y. Park, and J. Seo, “NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer,” 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings, pp. 1576–1586, 2020, doi: 10.18653/v1/2020.semeval-1.206.
[27] K. Socha, “{KS}@{LTH} at {S}em{E}val-2020 Task 12: Fine-tuning Multi- and Monolingual Transformer Models for Offensive Language Detection,” in Proceedings of the Fourteenth Workshop on Semantic Evaluation, Barcelona (online): International Committee for Computational Linguistics, 2020, pp. 2045–2053. doi: 10.18653/v1/2020.semeval-1.270.
[28] M. Pàmies, E. Öhman, K. Kajava, and J. Tiedemann, “LT@Helsinki at SemEval-2020 Task 12: Multilingual or language-specific BERT?,” 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings, pp. 1569–1575, 2020, doi: 10.18653/v1/2020.semeval-1.205.
[29] R. Raj, S. Srivastava, and S. Saumya, “NSIT & IIITDWD @ HASOC 2020: Deep learning model for hate-speech identification in Indo-European languages,” CEUR Workshop Proc, vol. 2826, pp. 161–167, 2020.
[30] M. A. Bashar and R. Nayak, “QutNocturnal@HASOC’19: CNN for hate speech and offensive content identification in Hindi language,” CEUR Workshop Proceedings, vol. 2517, no. December, pp. 237–245, 2019.
[31] P. Mishra, H. Yannakoudakis, and E. Shutova, “Tackling Online Abuse: A Survey of Automated Abuse Detection Methods,” no. 2013, 2019.
[32] Q. Que, R. Sun, and S. Xie, “Simon@HASOC 2020: Detecting hate speech and offensive content in German language with BERT and ensembles,” CEUR Workshop Proceedings, vol. 2826, pp. 283–289, 2020.
[33] S. Sai and Y. Sharma, “Siva@HASOC-Dravidian-CodeMix-FIRE-2020: Multilingual offensive speech detection in code-mixed and romanized text,” CEUR Workshop Proceedings, vol. 2826, pp. 336–343, 2020.
[34] V. Pathak, M. Joshi, P. Joshi, M. Mundada, and T. Joshi, “KBCNMUJAL@HASOC-Dravidian-CodeMixFIRE2020: Using machine learning for detection of hate speech and offensive code-mixed social media text,” CEUR Workshop Proceedings, vol. 2826, pp. 351–361, 2020.
[35] G. Arora, “Gauravarora@HASOC-Dravidian-CodeMixFIRE2020: Pre-training ULMFiT on synthetically generated code-mixed data for hate speech detection,” CEUR Workshop Proceedings, vol. 2826, pp. 362–369, 2020.
[36] H.-P. Su, Z.-J. Huang, H.-T. Chang, and C.-J. Lin, “Rephrasing Profanity in {C}hinese Text,” in Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada: Association for Computational Linguistics, 2017, pp. 18–24. doi: 10.18653/v1/W17-3003.
[37] X. Tang, X. Shen, Y. Wang, and Y. Yang, “Categorizing Offensive Language in Social Networks: A Chinese Corpus, Systems and an Explanation Tool,” in Chinese Computational Linguistics, M. Sun, S. Li, Y. Zhang, Y. Liu, S. He, and G. Rao, Eds., Cham: Springer International Publishing, 2020, pp. 300–315.
[38] H. Yang and C.-J. Lin, “{TOCP}: A Dataset for {C}hinese Profanity Processing,” in Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, Marseille, France: European Language Resources Association (ELRA), 2020, pp. 6–12.
[39] M. Polignano, V. Basile, P. Basile, M. de Gemmis, and G. Semeraro, “AlBERTo: Modeling Italian Social Media Language with BERT,” Italian Journal of Computational Linguistics, vol. 5, no. 2, pp. 11–31, 2019, doi: 10.4000/ijcol.472.
[40] H. Rizwan, M. H. Shakeel, and A. Karim, “Hate-Speech and Offensive Language Detection in {R}oman {U}rdu,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online: Association for Computational Linguistics, 2020, pp. 2512–2522. doi: 10.18653/v1/2020.emnlp-main.197.
[41] R. U. Mustafa and P. M. Saqib Nawaz, Fournier-viger, “Early Detection of Controversial Urdu Speeches from Social Media,” Data Science and Pattern Recognition, vol. 1, no. 2, pp. 26–42, 2017.
[42] M. P. Akhter, Z. Jiangbin, I. R. Naqvi, M. Abdelmajeed, and M. T. Sadiq, “Automatic Detection of Offensive Language for Urdu and Roman Urdu,” IEEE Access, vol. 8, pp. 91213–91226, 2020, doi: 10.1109/ACCESS.2020.2994950.
[43] A. M. Ishmam and S. Sharmin, “Hateful Speech Detection in Public Facebook Pages for the Bengali Language,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 555–560. doi: 10.1109/ICMLA.2019.00104.
[44] Md. R. Karim, B. Raja Chakravarthi, J. P. McCrae, and M. Cochez, “Classification Benchmarks for Under-resourced Bengali Language based on Multichannel Convolutional-LSTM Network,” in 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), 2020, pp. 390–399. doi: 10.1109/DSAA49011.2020.00053.
[45] N. Romim, M. Ahmed, H. Talukder, and Md. Saiful Islam, “Hate Speech Detection in the Bengali Language: A Dataset and Its Baseline Evaluation,” in Proceedings of International Joint Conference on Advances in Computational Intelligence, M. S. Uddin and J. C. Bansal, Eds., Singapore: Springer Singapore, 2021, pp. 457–468.
[46] J. Moon, W. I. Cho, and J. Lee, “BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection,” pp. 25–31, 2020, doi: 10.18653/v1/2020.socialnlp-1.4.
[47] S. Saketh Aluru, B. Mathew, P. Saha, and A. Mukherjee, “Deep Learning Models for Multilingual Hate Speech Detection *,” 2020.
[48] N. Ousidhoum, Z. Lin, H. Zhang, Y. Song, and D. Y. Yeung, “Multilingual and multi-aspect hate speech analysis,” EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, pp. 4675–4684, 2019, doi: 10.18653/v1/d19-1474.
[49] N. Ghanghor, R. Ponnusamy, P. K. Kumaresan, R. Priyadharshini, S. Thavareesan, and B. R. Chakravarthi, “{IIITK}@{LT}-{EDI}-{EACL}2021: Hope Speech Detection for Equality, Diversity, and Inclusion in {T}amil , {M}alayalam and {E}nglish,” in Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion, Kyiv: Association for Computational Linguistics, 2021, pp. 197–203.
[50] I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, “Hate speech detection in the Indonesian language: A dataset and preliminary study,” in 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), 2017, pp. 233–238. doi: 10.1109/ICACSIS.2017.8355039.
[51] J. Á. González, L.-F. Hurtado, and F. Pla, “TWilBert: Pre-trained deep bidirectional transformers for Spanish Twitter,” Neurocomputing, vol. 426, pp. 58–69, 2021, doi: https://doi.org/10.1016/j.neucom.2020.09.078.
[52] M. Ptaszynski, A. Pieciukiewicz, and P. Dybała, “Results of the PolEval 2019 Shared Task 6 : first dataset and Open Shared Task for automatic cyberbullying detection in Polish Twitter,” in Proceedings of the PolEval 2019 Workshop, M. Ogrodniczuk and Ł. Kobyliński, Eds., Warszawa: Institute of Computer Sciences. Polish Academy of Sciences, 2019, pp. 89–110.
[53] K. Madukwe, X. Gao, and B. Xue, “In Data We Trust: A Critical Analysis of Hate Speech Detection Datasets,” in Proceedings of the Fourth Workshop on Online Abuse and Harms, Online: Association for Computational Linguistics, 2020, pp. 150–161. doi: 10.18653/v1/2020.alw-1.18.
[54] P. Fortuna and S. Nunes, “A survey on automatic detection of hate speech in text,” ACM Comput Surv, vol. 51, no. 4, 2018, doi: 10.1145/3232676.
[55] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, and M. Wojatzki, “Measuring the Reliability of Hate Speech Annotations: The Case of the European Refugee Crisis,” 2017, doi: 10.17185/duepublico/42132.
[56] Z. Waseem and D. Hovy, “Hateful symbols or hateful people? predictive features for hate speech detection on twitter,” HLT-NAACL 2016 - 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Student Research Workshop, pp. 88–93, 2016, doi: 10.18653/v1/n16-2013.
[57] Z. Zhang and L. Luo, “Hate speech detection: A solved problem? The challenging case of long tail on Twitter,” Semantic Web, vol. 10, pp. 925–945, 2019, doi: 10.3233/SW-180338.
[58] K. Miok, B. Škrlj, D. Zaharie, and M. Robnik-Šikonja, “To BAN or Not to BAN: Bayesian Attention Networks for Reliable Hate Speech Detection,” Cognitive Computation, vol. 14, no. 1, pp. 353–371, 2022, doi: 10.1007/s12559-021-09826-9.
[59] H. Watanabe, M. Bouazizi, and T. Ohtsuki, “Hate Speech on Twitter: A Pragmatic Approach to Collect Hateful and Offensive Expressions and Perform Hate Speech Detection,” IEEE Access, vol. 6, pp. 13825–13835, 2018, doi: 10.1109/ACCESS.2018.2806394.
[60] M. Wiegand, J. Ruppenhofer, and T. Kleinbauer, “{D}etection of {A}busive {L}anguage: the {P}roblem of {B}iased {D}atasets,” in Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 602–608. doi: 10.18653/v1/N19-1060.
[61] S. Mishra and S. Mishra, “3Idiots at HASOC 2019: Fine-tuning transformer neural networks for hate speech identification in indo-european languages,” CEUR Workshop Proceedings, vol. 2517, pp. 208–213, 2019.
[62] A. Schmidt and M. Wiegand, “A Survey on Hate Speech Detection using Natural Language Processing,” in Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media, Valencia, Spain: Association for Computational Linguistics, 2017, pp. 1–10. doi: 10.18653/v1/W17-1101.
[63] E. Mosca, M. Wich, and G. Groh, “Understanding and Interpreting the Impact of User Context in Hate Speech Detection,” in Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, Online: Association for Computational Linguistics, 2021, pp. 91–102. doi: 10.18653/v1/2021.socialnlp-1.8.
[64] Z. Waseem, T. Davidson, D. Warmsley, and I. Weber, “Understanding Abuse: A Typology of Abusive Language Detection Subtasks,” pp. 78–84, 2017, doi: 10.18653/v1/w17-3012.
[65] S. Lei, W. Yi, C. Ying, and W. Ruibin, “Review of attention mechanism in natural language processing,” Data Analysis and Knowledge Discovery, vol. 4, no. 5, pp. 1–14, 2020.
[66] J. Devlin, M.-W. Chang, K. Lee, K. T. Google, and A. I. Language, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” [Online]. Available: https://github.com/tensorflow/tensor2tensor
[67] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer Learning in Natural Language Processing,” in Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Tutorials, Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 15–18. doi: 10.18653/v1/N19-5004.
[68] T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, pp. 4996–5001, 2020, doi: 10.18653/v1/p19-1493.
[69] Hugging Face, “bert-base-multilingual-cased.” https://huggingface.co/bert-base-multilingual-cased (accessed Dec. 20, 2022).
[70] Y. Zhao and X. Tao, “ZYJ123@DravidianLangTech-EACL2021: Offensive Language Identification based on XLM-RoBERTa with DPCNN,” Proceedings of the 1st Workshop on Speech and Language Technologies for Dravidian Languages, DravidianLangTech 2021 at 16th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2021, pp. 216–221, 2021.
[71] EMBEDDIA, “litlat-bert.” https://huggingface.co/EMBEDDIA/litlat-bert (accessed Dec. 20, 2022).
[72] Hugging Face, “ELECTRA.” https://huggingface.co/docs/transformers/model_doc/electra (accessed May 27, 2022).
[73] K. Clark, M.-T. Luong, G. Brain, Q. V Le Google Brain, and C. D. Manning, “ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS.” [Online]. Available: https://github.com/google-research/
[74] M. Mosbach, M. Andriushchenko, and D. Klakow, “On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines,” 2020.
[75] Y. Hao, L. Dong, F. Wei, and K. Xu, “Investigating Learning Dynamics of {BERT} Fine-Tuning,” Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 87–92, 2020.