With the rapidly growing diversity of CAT tools in this era of Machine Translation (MT) and AI development, keeping up with the latest trends can be a daunting task. However, during my stay at the Middlebury Institute of International Studies (MIIS), I had gotten some exposure to translation engine training software, various CAT tools, as well as how to utilize Regular Expressions to streamline the localization process. In this post, I’d like to walk you through a few of the projects I completed during the Spring semester of 2019.
Let’s first focus on translation engine training for a moment. Statistical Machine Translation (SMT) and, even more so, Neural Machine Translation (NMT) seem to be hot topics in the language industries. Although their prevalence is somewhat controversial, they are nevertheless an important consideration for any stakeholders. And so, as part of our up-to-date curriculum at MIIS, we engaged in a small project whereby we trained an SMT engine using Microsoft Custom Translator, ran an evaluation on our results, and drafted a proposal to a “client” with our recommendations for potential further training.
Our goal was to train an engine to effectively machine translate TEDTalks speeches covering marine biology and oceanography. To accomplish this, we compiled a series of “training”, “tuning”, and “testing” bitext files in an attempt to improve the BLEU score of the engine. This score is currently the industry standard, but has several disadvantages, which I won’t get into here for the sake of brevity, but, suffice it to say, for the purposes of our project, increases in the BLEU score meant positive results for our engine.
The “testing” files were general subject speeches pulled from a corpus of TEDTalks speeches from 2013. These gave our engine a basis with which to build its translation rules from. The “tuning” files, then, would be closer in domain to our desired results. For those, we used ocean science-centric speeches. The “testing” files would be used to sample the quality of the engine’s output after the training had concluded. To use them for our engine, we aligned them using the online CAT tool Memsource.
After training the engine, we performed a final evaluation of the quality of MT output. Compared to our initial model’s MT output, there wasn’t a significant change. That being said, there were noticeable time and cost savings that we calculated over human translation for our engine, but the MT quality was far too underwhelming to justify further training, especially considering the history of our BLEU score (which peaked on the second of ten models). You can find our presentation and proposal in the link below.
https://drive.google.com/open?id=1CbNSlgMeW-6OlPNOKdQ3cV9j3T4Ieiou
In addition to learning to train an SMT engine, we also received an arguably indispensable tool for localization: Regular Expressions (Regex). After coming up with some custom rules for my language of study (German), I got a taste of the potential for these powerful rules. Here’s an example:
Considering the potential complexity of Regex, this is an incredibly simplistic, yet effective, rule. In short, “die” is an article in German that cannot appear in the Dative case (feel free to look up German cases on Wikipedia, if you dare), so this rule looks for any instance of “die” following the preposition “mit”, which always precedes words in the Dative. This is a fast and effective way to find a potential error to correct it. Another small example might be to find dates in a particular format using Regex, and replacing them with the correct format. For localization, Regex can be a powerful tool to detecting and correcting errors, thereby speeding up editing and proofreading. Process optimization is always a major concern for project managers, and Regex is perfectly tailored, it seems, to aid that. I have a few screenshots of some other rules I wrote linked below, for your curiosity.
https://drive.google.com/open?id=1dchG2byzepoBfDdF38v4TuMa1pGGmkfb
And, finally, I also demoed an online CAT tool called matecat, and created a video walking through it. Testing out new CAT tools is a fantastic way to broaden industry knowledge, and so I am more than happy to provide at least one small corner of that industry. The video is linked below in google drive.
If you have any questions or suggestions, please feel free to send me a message via my Contact page.