Document Type : Research Paper

Author

Assistant Professor of Translation Studies, English Language Department, Hazrat-e Masoumeh University, Qom, Iran

Abstract

Given that adherence to the orthographic principles and conventions of the target language constitutes a significant dimension of translation quality, this study investigated the compliance of English-to-Persian journalistic translations generated by five online machine translation (MT) systems with guidelines of the approved orthography of the Academy of Persian Language and Literature (APLL) and standard usage rules for punctuation marks. To this end, several texts were extracted from English newspapers and were translated into Persian by five systems (ChatGPT, DeepSeek, Google Translate, Microsoft Translator, and Reverso). Through meticulous, word-by-word reading and analysis of the translations in the light of the guidelines of the Approved orthography and usage rules for punctuation marks, a total of 16 distinct categories of orthographic and punctuation errors were identified in the machine-generated translations. These were later classified into three main types: (1) errors in the use of non-breaking or half-space (Nim-Faslé) within word elements and elements of compound words, (2) errors in the orthography of secondary Persian characters, and (3) errors in the correct spacing of punctuation marks. Google, Microsoft, and Reverso produced instances of all error types. DeepSeek and ChatGPT exhibited 4 and 12 types of these errors in their outputs, respectively. Furthermore, the results indicated that none of the systems demonstrated complete consistency in performance so that a given compound or character was at times rendered correctly and other times incorrectly. It was, also, discerned that the identified errors bear a significant resemblance to orthographic errors prevalent in human-generated texts. This phenomenon is attributable to the predominantly open-source monolingual and bilingual corpora used for training these online systems, which comprise texts authored and translated by diverse individuals across varied contexts and registers. Considering the increasing reliance on online MT systems, such orthographic and punctuation errors possess the potential to gradually permeate into the texts generated by Persian-speaking communities, potentially undermining the APLL's ongoing orthographic standardization efforts. One proposed solution for this problem entails the development, either by the MT providers themselves or independent developers, of dedicated or integrated auxiliary systems designed to scrutinize and post-edit translations prior to final user delivery, specifically evaluating their conformity with the approved orthographic guidelines and standard punctuation conventions

Keywords

Main Subjects

Academy of Persian Language and Literature (2023). Persian Orthography. Nashr-e Asar: Tehran. [in Persian]
Ahmadinasab, F., Kazemifard, A., & Azimifard, F. (2022). Evaluation and Ranking of the National Media Networks' Adherence to the Persian Orthography Approved by the Academy of Persian Language and Literature Using a Novel Approach of Multi-Criteria Decision-Making Theory. Information Processing and Management, 37(4), 1127-1152. [in Persian] https://doi.org/10.35050/JIPM010.2022.005
Ahmadinasab, Fatemeh (2015). The Necessity of Applying the Orthography Approved by the Academy of Persian Language and Literature (in Persian Scientific Journals, to Promote Persian Script and Language as the Language of Science). Collection of Articles from the 10th International Conference on the Promotion of Persian Language and Literature. University of Mohaghegh Ardabili. [in Persian] https://www.sid.ir/paper/843605/fa
Akhshik, S. (2016). Script and Error: The Reflection of Difficulties in Word Writing on Information Retrieval in the Country's Magazines Database (Mag Iran). The First International Conference on Interactive Information Retrieval. [in Persian] https://civilica.com/doc/572879
Akhshik, S., & Fattahi, R. (2012). Analyzing the Challenges of Conjoined and Separate Writing of Persian Words in Information Storage and Retrieval in Databases. Library and Information Science, 15(3), 9-30. [in Persian] https://lis.aqr-libjournal.ir/article_42907.html
Asadollahi, Kh., & Azarniyavar, L. (2021). Pathological Study of the Writing and Editing of the Civil Code Based on Persian Grammar.Grammatical and Rhetorical Research, 11(20), 287-315. [in Persian] doi: 10.22091/jls.2022.8066.1386
Bassak, H., Sa'adatzadeh, M., & Bassak, H. (2013). A Critical Study of the Writing, Editing, and Grammatical Aspects of Verdicts from Public Criminal Courts of Mashhad and Identification of Their Weaknesses and Influential Factors. Judicial Procedure, (4-5), 11-36. [in Persian]
Burchardt, A., Macketanz, V., Dehdari, J., Heigold, P., & van den Heuvel, H. (2022). A taxonomy of terminological errors in machine translation. In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation (pp. 47–56). European Association for Machine Translation.
https://aclanthology.org/2022.eamt-1.6
Cintas, J. D., & Remael, A. (2020). Subtitling: Concepts and practices. Routledge, London and New York.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. 10.30479/jtpsol.2024.20476.1669
Fedus, W., Zoph, B., & Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. Journal of Machine Learning Research22(120), 1–39.
Guzmán, F., Chen, X., & Orăsan, C. (2019). A multifaceted comparison of translation paradigms and their effects on punctuation. Machine Translation, 33(3), 205–230.https://doi.org/10.1007/s10590-019-09232-x
Hammo, B. H. (2009). Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents. Information Retrieval 12(3): 300-323. http://link.springer.com/article/10.1007/s10791-008-9081-9.
Hashemi, S. H. (2011). An Investigation of Textbooks for the 2008-09 Academic Year in Elementary Education Regarding the Level of Consistency in the Separation or Connection of Verbal and Non-Verbal Stems in Compound Words. Research in Persian Language and Literature, 9(3/21), 1-10. [in Persian] https://www.sid.ir/paper/56353/fa
Keshavarz, P., & Sattari, A. (2011). The Use of Persian Script on Television. Tehran: Radio and Television Research Center. [in Persian]
Kolahdoozan, A., Moeini, M., Papi, A., Asgari, Gh., & Zolfaghari, B. (2004). Investigating the Frequency Distribution of Non-compliance with Persian Writing Principles and Rules in Master's and PhD Theses of Medical and Pharmacy Schools in the Academic Year 1999-2000. Health Information Management, 1(2), 50-56. [in Persian] https://him.mui.ac.ir/article_10859.html
Lazarinis, F. (2007). At the sharp END evaluating the searching capabilities of commerce websites in a non-English language A Greek case study. Online Information Review, 31(6): 881-891.
      http://www.emeraldinsight.com/journals.htm?articleid=1640585.
Lazarinis. (2008). Improving concept-based web image retrieval by mixing semantically similar Greek queries. Program: electronic library and information systems, 42(1), 56-67.
     http://www.emeraldinsight.com/journals.htm?articleid=1674242.
Lewandowski, D. (2008). Problems with the use of Web search engines to find results in foreign languages. Online Information Review 32(4): 668-672. http://www.emeraldinsight.com/journals.htm?articleid=1747662.
Madadian, Gh. (2024). A Study of Orthographic and Punctuation Errors in the Translations of Undergraduate Translation Students Based on the Orthography Approved by the Academy of Persian Language and Literature. Journal of Teaching Persian to Speakers of Other Languages, 13(1), 161-204. [in Persian] https://doi.org/10.30479/jtpsol.2024.20476.1669
Microsoft Research. (2023). Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization. Microsoft Research Blog
Modarres Khiabani, Sh. (2018). A Pathological Study of Subtitles on News and iFilm TV Channels: A Corpus-Based Research. Audiovisual Media, 12(27), 31-60. [in Persian] https://dorl.net/dor/20.1001.1.26454696.1397.12.27.3.4
Monz, C. & De Rijke, M. (2002). Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian. Evaluation of Cross-Language Information Retrieval Systems: Second Workshop of the Cross Language Evaluation Forum, CLEF 2001, Darmstadt, Germany.
Moukdad, H. (2005). Lost in cyberspace: How Do Search Engines Handle Arabic Queries? The international information & library review, 37(4): 237-394. https://journals.library.ualberta.ca/ojs.cais-acsi.ca/index.php/cais-asci/article/view/334/282
Popović, M. (2018). Error classification and analysis for machine translation quality assessment. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation (pp. 195–204). European Association for Machine Translation.
https://aclanthology.org/W18-1920/
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Ranjbar, A., Abbas Pour, J., Sotoudeh, H., & Mauludi, A. S. (2019). Investigating the Level of Compliance of the Writing Behavior of Authors and Users of Persian Scientific Information Databases with the Guidelines Approved by the Academy of Persian Language and Literature Regarding Conjoined, Hyphenated, and Separate Writing of Words. Library and Information Science, 22(3), 164-187. [in Persian] doi: 10.30481/lis.2019.53904
Sadeghi, A. A., & Zandimoghadam, Z. (2015). Persian Orthographic Dictionary Based on Persian Orthography. Nashr-e Asar: Tehran. [in Persian]
Samaie, F. (2009). An Investigation of the Linguistic Issues in the Subtitles of the News Network. Tehran: Radio and Television Research Center. [in Persian]
ShiaAli, F. (2013). A Critical Study of the Writing Style of Verdicts from Public and Civil Courts of Mashhad (Master's thesis in Persian Language and Literature). Payame Noor University, Mashhad Branch. [in Persian]
Sotoudeh, H., & Hanarjuyan, Z. (2012). A Review of the Difficulties of the Persian Language in the Digital Environment and Its Impact on the Effectiveness of Automatic Text Processing and Information Retrieval. Library and Information Science, 15(4), 59-92. [in Persian] https://lis.aqr-libjournal.ir/article_42651.html
Toth, E. (2006). Exploring the Capabilities of English and Hungarian Search Engine for Various Queries. Libri, 56, 38-47. https://www.degruyter.com/document/doi/10.1515/LIBR.2006.38/html
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998-6008.
Zhang, J., & Suyu, L. (2007). Multiple language supports in search engines. Online Information Review, 31(4): 516-532. http://www.emeraldinsight.com/journals.htm?articleid=1621798.
Zolfaghari, H. (2007). Pathology of the Language of the Press. Scientific Quarterly of Media, 18(4), 9-42. [in Persian] https://dor.isc.ac/dor/20.1001.1.10227180.1386.18.4.1.0.