Pilot project: Machine Translation Training

Introduction:

This portfolio is about a group project we made during the Advanced CAT Tool class. Our team, Viral Girls, formed a team to train the machine to translate better. We chose medicine as our domain and epidemics as our topic of source materials, since we are now haunted by COVIC-19 and are exposed to different types of information about this every day. We want the English news, paper other source of information about the disease can be translated into Simplified Chinese in good quality and be read by people across the ocean as quickly as possible. For this pilot project, we first come up a proposal giving our predictions about how the machine training will improve efficiency and quality of its translation, setting a QA metric model before conducting the project. During the process, we run 10 rounds to train the machine. After that, BLEU score for each round are compared, and QA scores given by our post-editor are compared, too, to see whether the quality has improved. Then we update our proposal and in which, we give our suggestions to further improve machines translation and list our lessons learnt.

We prepared proposal and report to our “client” in the kick-off meeting. After we got the approval, we went through the workflow as listed below:

1. Data Collection & File Preparation: 

  • Find relevant original Mandarin Chinese documents that have English translations. 
  • Convert data to the correct format if needed. 
  • Use auto-alignment for training data, manually align documents for tuning data into .tmx files in Trados. 
  • Post-editing will be involved in this process in an appropriate manner.

2. Data Cleaning: use Olifant to clean the data that have been collected to ensure the basic quality of the aligned files.

3. Data Training: Separate data for different purposes and then perform 10 rounds of data training in MS Custom Translator.

4. Data Reviewing: after each round of training, get the BLEU score to check the improvements. If the improvements are not visible, collect more and better data to redo the training.

5. Data Assessment: based on the results of data training, further assess the estimated time and cost for post-editing.

Project Files:

01 Proposal

02 Updated Proposal

03 Conclusion presentation of the project

What have I learnt from this:

  • MT improved significantly with the increase of testing and tuning data
  • Diverse, new and manually aligned data included improved the quality of the MT training
    • Diverse, in this context, meant high quality, informative news articles and official documents
    • Manual aligning meant more human attention was paid to the unification of segments (i.e. more human involvement = better MT output)
    • Introducing new data meant that the MT had more opportunities to learn better output patterns