The Adv. Computer Assisted Technology (CAT) course this semester, on the basis of Intro to CAT, I gained a deeper understanding of the different types of productivity software, Pseudo Translation of WordPress posts, utility tools and Regex (Regular Expressions).  For the first half of the semester, we mainly focused on Training a Neural Statistical Machine Translation Engine.

1. Collecting Data

We worked with Microsoft Custom Translator and chose UN financing reports for our training, testing and tuning data. Because we could easily find large amount of paralled corpus from UN OPUS paralle corpus, which is necessary for machine training for it needs huge amount of data to learn the pattern. Secondly, the language of these reports are very controlled and repetitive. They are all written in the same template as well. These are all primary factors we need to consider when talking about machine translaiton.

2. Data cleaning

After we’ve gathered out training, testing and tuning data from OPUS, we used Youalign and TMXeditor to clean up the data, in order to achieve a better result. The tool we use is Heartsome TMXeditor. It has a feature that enables us to delete the untranslated sentences and repetitions.

Oilfant is another very useful tool that could do TMX cleaning.

3. Traning

With the cleaned-up data, our training went pretty well. As we add more cleaned sentences, we got a relatively high BLEU score.

4. QA

We carried out 2 rounds of QA, one in the middle and the other after all the training is done. The machine did a pretty good job at accuracy and fluency. We used LISA metric for evaluation.

Based on the results, please click the link to see our original proposal and the updated after we have finished our 14 rounds of training. And the presentation of Lessons Learned.

Original Proposal

https://drive.google.com/open?id=1KbYhhz7NKuwi9m04dvqTR6rwp0O1mTq3oyaR4tgfc3g

Updated Proposal

https://drive.google.com/drive/u/1/folders/1D-nnO6rEgMebattUMU_mB2yAQIjuq0If

Presenation of Lessons Learned

https://drive.google.com/open?id=1elKC5PCRzn3899wg055WWQuaO3Kq37g0