When Computer-Assisted Translation (CAT) Tools Meet Regex

One of the most important tools for Localization Quality Assurance in Computer-Assisted Translation (CAT) Tools is Regular Expressions. Regular Expressions can help translators identify errors in localization such as date and time formats. First the user must set rules for noticing errors, and then the program can find them and flag them for manual checking. After that, the translator can pay attention to each flag and make a change.

Punctuation

The Chinese punctuation marks are almost the same as the English ones. However, the Chinese period is two spaces wide and the English period is one space wide.

ZH-CN: (。)

EN-US: (.)

In Trados, we used the Find code \.w* and the Replace code 。 to replace all English periods with Chinese periods.

Addresses

Chinese addresses are written in the opposite order from American addresses. Also, all Chinese addresses contain the characters 中国, 市, 房. This is because Chinese addresses begin with “China” and often end with “Room.”

Therefore it is easy to write a Regex statement to search for addresses in Chinese.

[中国].+[市].+\d+[房]

After flagging the Chinese address, the translator can change the format to an international address format.

Telephone

Chinese readers should know to dial +1 before American phone numbers, which is not given normally to American readers.

This Regex statement finds American phone numbers. It also checks if the target phone number has a +1, which is the United States telephone code. If not, then the sentence is flagged. The number should be +86 if the translator is translating Chinese numbers into English.

Source: \(\d{3}\)d{3}-\d{4}

Target: \+1\(\d{3}\)d{3}-\d{4}

The translator should also check for the number formatting. Chinese phone numbers usually do not have dashes or parentheses.

Date

The date format in the United States is MM-DD-YY. In China, the standard is YYYY-MM-DD, sometimes with Chinese words for “year,” “month,” “day.”

The source check is [01]?\d\-[0123]?\d\-[12]\d{3}, and the target check is \d+[年]\d[01]?\d[月]\d[0123]?\d[日].

Class and Grade

When talking about students, the English standard is to say the student’s class and then the grade. In Chinese it is the opposite. Therefore the check has to not only check for the Chinese words but also the sequence.

Source: \w+[Class]\s\d+\s\w+[Grade]\s\d+

The target is simpler because there are no spaces between Chinese words.

Target: \d+[年]\d+[班]


CAT Team Project Files

Please free feel to click the links below

The Introduction to Computer-Assisted Translation course

The Introduction to Computer-Assisted Translation course is a general course for gaining an overview of different CAT tools, like memoQ and SDL Trados Studio. The course prepares students for completing localization projects from start to finish.

One of the most important tools for Quality Assurance is Regular Expressions. Regular Expressions can help translators identify errors in localization such as date and time formats. First the user must set rules for noticing errors, and then the program can find them and flag them for manual checking. After that, the translator can pay attention to each flag and make a change.

Punctuation

The Chinese punctuation marks are almost the same as the English ones. However, the Chinese period is two spaces wide and the English period is one space wide.

ZH-CN: (。)

EN-US: (.)

In Trados, we used the Find code \.w* and the Replace code 。 to replace all English periods with Chinese periods.

Addresses

Chinese addresses are written in the opposite order from American addresses. Also, all Chinese addresses contain the characters 中国, 市, 房. This is because Chinese addresses begin with “China” and often end with “Room.”

Therefore it is easy to write a Regex statement to search for addresses in Chinese.

[中国].+[市].+\d+[房]

After flagging the Chinese address, the translator can change the format to an international address format.

Telephone

Chinese readers should know to dial +1 before American phone numbers, which is not given normally to American readers.

This Regex statement finds American phone numbers. It also checks if the target phone number has a +1, which is the United States telephone code. If not, then the sentence is flagged. The number should be +86 if the translator is translating Chinese numbers into English.

Source: \(\d{3}\)d{3}-\d{4}

Target: \+1\(\d{3}\)d{3}-\d{4}

The translator should also check for the number formatting. Chinese phone numbers usually do not have dashes or parentheses.

Date

The date format in the United States is MM-DD-YY. In China, the standard is YYYY-MM-DD, sometimes with Chinese words for “year,” “month,” “day.”

The source check is [01]?\d\-[0123]?\d\-[12]\d{3}, and the target check is \d+[年]\d[01]?\d[月]\d[0123]?\d[日].

Class and Grade

When talking about students, the English standard is to say the student’s class and then the grade. In Chinese it is the opposite. Therefore the check has to not only check for the Chinese words but also the sequence.

Source: \w+[Class]\s\d+\s\w+[Grade]\s\d+

The target is simpler because there are no spaces between Chinese words.

Target: \d+[年]\d+[班]


CAT Team Project Files

Please free feel to click the links below

When Computer-Assisted Translation (CAT) Tools Meet Regex

One of the most important tools for Localization Quality Assurance in Computer-Assisted Translation (CAT) Tools is Regular Expressions. Regular Expressions can help translators identify errors in localization such as date and time formats. First the user must set rules for noticing errors, and then the program can find them and flag them for manual checking. After that, the translator can pay attention to each flag and make a change.

Punctuation

The Chinese punctuation marks are almost the same as the English ones. However, the Chinese period is two spaces wide and the English period is one space wide.

ZH-CN: (。)

EN-US: (.)

In Trados, we used the Find code \.w* and the Replace code 。 to replace all English periods with Chinese periods.

Addresses

Chinese addresses are written in the opposite order from American addresses. Also, all Chinese addresses contain the characters 中国, 市, 房. This is because Chinese addresses begin with “China” and often end with “Room.”

Therefore it is easy to write a Regex statement to search for addresses in Chinese.

[中国].+[市].+\d+[房]

After flagging the Chinese address, the translator can change the format to an international address format.

Telephone

Chinese readers should know to dial +1 before American phone numbers, which is not given normally to American readers.

This Regex statement finds American phone numbers. It also checks if the target phone number has a +1, which is the United States telephone code. If not, then the sentence is flagged. The number should be +86 if the translator is translating Chinese numbers into English.

Source: \(\d{3}\)d{3}-\d{4}

Target: \+1\(\d{3}\)d{3}-\d{4}

The translator should also check for the number formatting. Chinese phone numbers usually do not have dashes or parentheses.

Date

The date format in the United States is MM-DD-YY. In China, the standard is YYYY-MM-DD, sometimes with Chinese words for “year,” “month,” “day.”

The source check is [01]?\d\-[0123]?\d\-[12]\d{3}, and the target check is \d+[年]\d[01]?\d[月]\d[0123]?\d[日].

Class and Grade

When talking about students, the English standard is to say the student’s class and then the grade. In Chinese it is the opposite. Therefore the check has to not only check for the Chinese words but also the sequence.

Source: \w+[Class]\s\d+\s\w+[Grade]\s\d+

The target is simpler because there are no spaces between Chinese words.

Target: \d+[年]\d+[班]


CAT Team Project Files

Please free feel to click the links below