Publication: Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model
| dc.contributor.author | C. Okan Sakar | |
| dc.contributor.author | Burak Aytan | |
| dc.contributor.institution | Yabancı Kurumlar | |
| dc.contributor.institution | BAHÇEŞEHİR ÜNİVERSİTESİ | |
| dc.date.accessioned | 2025-09-20T19:55:36Z | |
| dc.date.issued | 2023 | |
| dc.date.submitted | 12.06.2023 | |
| dc.description.abstract | Spell checking and correction is an important step in the text normalization process. These tasks are more challenging in agglutinative languages such as Turkish since many words can be derived from the root word by combining many suffixes. In this study, we propose a two-step deep learning-based model for misspelled word detection in the Turkish language. A false positive reduction model is integrated into the system to reduce the false positive predictions originating from the use of foreign words and abbreviations that are commonly used in Internet sharing platforms. For this purpose, we create a multi-class dataset by developing a mobile application for labeling. We compare the effect of using different types of tokenizers including character-based, syllable-based, and byte-pair encoding (BPE) approaches together with Long Short-Term Memory (LSTM) and Bi-directional LSTM (Bi-LSTM) networks. The findings show that the proposed Bi-LSTM-based model with the BPE tokenizer is superior to the benchmarking methods. The results also indicate that the false positive reduction step significantly increased the precision of the base detection model in exchange for a comparably less drop in its recall. | |
| dc.identifier.doi | 10.55730/1300-0632.4003 | |
| dc.identifier.endpage | 595 | |
| dc.identifier.issn | 1300-0632 | |
| dc.identifier.issn | 1300-0632 | |
| dc.identifier.issue | 3 | |
| dc.identifier.startpage | 581 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.14719/4561 | |
| dc.identifier.volume | 31 | |
| dc.language.iso | en | |
| dc.relation.journal | Turkish Journal of Electrical Engineering and Computer Sciences | |
| dc.subject | Bilgisayar Bilimleri | |
| dc.subject | Yazılım Mühendisliği | |
| dc.subject | Dil ve Dil Bilim | |
| dc.title | Deep learning-based Turkish spelling error detection with a multi-class false positive reduction model | |
| dc.type | Research Article | |
| dcterms.references | [1] Rapp A, Curti L, Boldi A. The human side of human-chatbot interaction: A systematic literature review of ten years of research on text-based chatbots. International Journal of Human-Computer Studies. 2021,151:102630.3. https://doi.org/10.1016/j.ijhcs.2021.102630,[2] Singh D, Reddy S, Hamilton W, Dyer C, Yogatama D. End-to-end training of multi-document reader and retriever for open-domain question answering. Advances in Neural Information Processing Systems. 2021, 34:25968-81.5,[3] Ali A, Nakov P, Bell P, Renals S. WERd: Using social text spelling variants for evaluating dialectal speech recognition. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, 7141-7148. https://doi.org/10.1109/ASRU.2017.8268928,[4] Bellegarda JR, Monz C. State of the art in statistical methods for language and speech processing. Computer Speech & Language. 2016, 35:163-84.10. https://doi.org/10.1016/j.csl.2015.07.001,[5] Birjali M, Kasri M, Beni-Hssane A. A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems. 2021, 226:107134.12. https://doi.org/10.1016/j.knosys.2021.107134,[6] Altuner AB, Kilimci ZH. A novel deep reinforcement learning based stock price prediction using knowledge graph and community aware sentiments. Turkish Journal of Electrical Engineering and Computer Sciences. 2022,30 (4):1506- 24.14. https://doi.org/10.55730/1300-0632.3862,[7] Aytan B, Sakar CO. Comparison of Transformer-Based Models Trained in Turkish and Different Lan- guages on Turkish Natural Language Processing Problems. In: 2022 30th Signal Processing and Commu- nications Applications Conference (SIU). IEEE, 2022. p. 1-4.17 (in Turkish with an abstract in English). https://doi.org/10.1109/SIU55565.2022.9864818,[8] Rivera-Trigueros I. Machine translation systems and quality assessment: a systematic review. Language Resources and Evaluation. 2021:1-27.19. https://doi.org/10.1007/s10579-021-09537-5,[9] Aggarwal CC. Machine learning for text: An introduction. In: Machine learning for text. Springer 2018, 1- 16.https://doi.org/10.1007/978-3-319-73531-3_1,[10] Varma R, Verma Y, Vijayvargiya P, Churi PP. A systematic survey on deep learning and machine learning approaches of fake news detection in the pre-and post-COVID-19 pandemic. International Journal of Intelligent Computing and Cybernetics. 2021, 23. https://doi.org/10.1108/IJICC-04-2021-0069,[11] Yildiz B, Emekci F. Name spell-check framework for social networks. Turkish Journal of Electrical Engineering and Computer Sciences. 2016,24 (4):2194-204.25. https://doi.org/10.3906/elk-1402-92,[12] Anbukkarasi S, Varadhaganapathy S. Neural network-based error handler in natural language processing. Neural Computing and Applications. 2022,34 (23):20629-38. https://doi.org/10.1007/s00521-022-07489-7,[13] Solak A, Oflazer K. Design and implementation of a spelling checker for Turkish. Literary and linguistic computing. 1993 ,8 (3):113-30. https://doi.org/10.1093/llc/8.3.113,[14] Yessenbayev Z, Kozhirbayev Z, Makazhanov A. KazNLP: A pipeline for automated processing of texts writ- ten in Kazakh language. InInternational Conference on Speech and Computer 2020, 657-666. Springer, Cham. https://doi.org/10.1007/978-3-030-60276-5_63,[15] Ozer H, Korkmaz EE. Transmorph: a transformer based morphological disambiguator for Turkish. Turkish Journal of Electrical Engineering and Computer Sciences. 2022,30 (5):1897-913. https://doi.org/10.55730/1300-0632.3912,[16] Choudhury M, Thomas M, Mukherjee A, Basu A, Ganguly N. How difficult is it to develop a perfect spell- checker? A cross-linguistic analysis through complex network approach. arXiv preprint physics/0703198. 2007 Mar 21. https://doi.org/10.48550/arXiv.physics/0703198,[17] Singh S, Singh S. HINDIA: a deep-learning-based model for spell-checking of Hindi language. Neural Computing and Applications. 2021 ,33 (8):3825-40. https://doi.org/10.1007/s00521-020-05207-9,[18] Hassan H, Menezes A. Social text normalization using contextual graph random walks. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2013 ,(pp. 1577-1586).,[19] Oflazer K. Spelling correction in agglutinative languages. arXiv preprint cmp-lg/9410004. 1994. https://doi.org/10.48550/arXiv.cmp-lg/9410004,[20] Akın AA, Akın MD. Zemberek, an open source NLP framework for Turkic languages. Structure. 2007,10 (2007):1-5.,[21] Balík M. Implementation of directed acyclic word graph. Kybernetika. 2002,38 (1):91-103.,[22] Damerau FJ. A technique for computer detection and correction of spelling errors. Communications of the ACM. 1964 ,7 (3):171-6. https://doi.org/10.1145/363958.363994,[23] Eryigit G, Torunoglu-Selamet Di. Social media text normalization for Turkish. Natural Language Engineering. 2017,23 (6):835-75. https://doi.org/10.1017/S1351324917000134,[24] Çolakoğlu T, Sulubacak U, Tantuğ AC. Normalizing non-canonical Turkish texts using machine translation ap- proaches. In The 57th Annual Meeting of the Association for Computational Linguistics 2019 Jul 28. The Association for Computational Linguistics.,[25] Heafield K. KenLM: Faster and smaller language model queries. In Proceedings of the sixth workshop on statistical machine translation 2011 (pp. 187-197).,[26] Klein G, Kim Y, Deng Y, Senellart J, Rush AM. Opennmt: Open-source toolkit for neural machine translation. arXiv preprint arXiv:1701.02810. 2017 . https://doi.org/10.48550/arXiv.1701.02810,[27] Büyük O. Context-dependent sequence-to-sequence turkish spelling correction. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP). 2020,19 (4):1-6. https://doi.org/10.1145/3383200,[28] Safaya A, Kurtuluş E, Göktoğan A, Yuret D. Mukayese: Turkish NLP Strikes Back. arXiv preprint arXiv:2203.01215. 2022 Mar 2. https://doi.org/10.48550/arXiv.2203.01215,[29] Viktor T, Gyorgy G, Halácsy P, Kornai A, Laszlo N et al. Hunmorph: open source word analysis.InWorkshop on Software 2005, 16:77-85,[30] Demir S, Topcu B. Graph-based Turkish text normalization and its impact on noisy text processing. Engineering Science and Technology, an International Journal. 2022, 35:101192. https://doi.org/10.1016/j.jestch.2022.101192,[31] Webster JJ, Kit C. Tokenization as the initial phase in NLP. InProceedings of the 14th conference on Computational linguistics-Volume 4 1992: pp. 1106-1110. https://doi.org/10.3115/992424.992434,[32] Sennrich R, Haddow B, Birch A. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. 2015 . https://doi.org/10.48550/arXiv.1508.07909,[33] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L et al. Attention is all you need. Advances in neural information processing systems. 2017,30. https://doi.org/10.48550/arXiv.1706.03762,[34] Raffel C, Shazeer N, Roberts A, Lee K, Narang S et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research. 2020,21 (140):1-67.,[35] Zhang S, Zheng D, Hu X, Yang M. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia conference on language, information and computation 2015 Oct (pp. 73-78).,[36] Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997,9 (8):1735-80. https://doi.org/10.1162/neco.1997.9.8.1735,[37] Schweter S. Berturk-bert models for Turkish. 2020,3770924. https://doi.org/10.5281/zenodo.30 | |
| dspace.entity.type | Publication | |
| local.indexed.at | TRDizin |
