Using Regex to Examine Conversion between Traditional and Simplified Chinese

In college I was assigned the political philosophical writings of Machiavelli, Locke, Hobbes, and Rousseau in Simplified Chinese. Honestly, what was more of a challenge was understanding the “translation-ese” than deciphering the Simplified Chinese characters.

Are Traditional and Simplified Chinese different enough to justify full payment for two locales? For some clients it’s not always the case. Some LSPs provide a discount when both Traditional and Simplified Chinese are involved. They may charge only 50% for one locale when the other is involved. After all, the relation between Traditional and Simplified Chinese is similar to that of American and British English: if you can read one, you can read the other.

If there is a text available in Simplified Chinese, LSPs can use software to convert it into Traditional Chinese. The text certainly will need some, occasionally much, editing, but you don’t need to translate it again.

If the nature of the project does not warrant full payment for two locales, clients may decide to use software to convert one Chinese locale into the other then ask LSPs to edit the material.

The editing however, is not necessarily straightforward. I’m not even going to talk about the difference between common nouns in Traditional and Simplified Chinese and their different resistance to Anglicized grammar. In this post I merely focus on “characters.” Take a look at this example:

I used a Chrome extension called “Simplify-Traditional Chinese Characters Conversion” to convert 武汉热干面(Wuhan Spicy Noodles, Simplified Chinese) to its counterpart in Traditional Chinese. The results, 武漢熱幹面, means “Wuhan Fxck Face.” In this case, 干 can be converted to either 乾 or 幹, and 面 needed to be converted to 麵, so the proper form would be 武漢熱乾麵.

The extension wasn’t able to address to these issues. The problem is that one character in Traditional Chinese can correspond to many characters in Simplified Chinese, and sometimes characters that exist in Simplified Chinese also exist in Traditional Chinese, albeit referring to different things.

We can definitely find ways, such as term bases or rules, to improve the software that converts Simplified and Traditional Chinese into each other. But the purpose of this post is just to show how Regex can help human editors facilitate conversion between the two locales.

I looked up a list of characters in Simplified Chinese that have multiple corresponding characters in Traditional Chinese, then I used the Regex command “|” to compile them into a simple code: 划|卤|历|发|只|台|后|坛|复|尽|干|并|当|志|汇|系|脏|荡|获|采|里|钟|饥|丰|丑|了|借|克|准|刮|制|吁|吊|团|困|布|御|斗|曲|松|淀|纤|致|蔑|仇|冬|咸|云|仆|舍|签|折|谷|几|辟|奸|游|佣|苏|回|面|向|伙|郁|朴|才|朱|别|卷|蒙|征|症|恶|注|哄|参|腌|彩|占|欲|扎|熏|赞|尝|烟|周|柜|喂|幸|凶|杰|针|戚|托挨|挽|栗|炼|链|穗|雕|梁|升|摆|岩|娘|僵|药|余|蜡|出|卜|同|漓|术|仑|秋|千|帘|庵|尸|胡|须|据|筑|夸|苹|袅|暗|冲|表|杆|鉴|搜|杯|铲|扣|念|杠|泛|核|巨|叹|价|私|局|拐|弦|哗|凄|家|席|酸|噪|咽|愈|凌|毁|苔|糊|抵|恤|荫|皂|芸|背|夫|迹.

If an LSP used software to convert Traditional Chinese into Simplified Chinese, they can use this code to notify what characters editors or proofreaders need to pay attention to, which is demonstrated in the following picture:

A next step would be to make a list characters in Traditional Chinese that often are not converted right by software into a Regex code to help editors and proofreaders spot mistakes. The list need not be exhaustive as it would be valuable enough to avoid the most common mistakes.

Again, I do not recommend clients always use software to convert between Simplified and Traditional Chinese: certain products are very sensitive to locale, and it is important to adapt to different locales. That said, in other cases, clients can save time and money with conversion software, and use Regex codes like this to facilitate the proofreading process.

Fun Fact

The movie, Snowden (2015), uses the same Regex code “|” to illustrate how NSA agents search for potential terrorists.