Pinyin Tone converter – Google Apps Script custom functions

I’ve never been shy of writing up something I don’t know much about, and today’s post is no different. A few weeks ago I’d never heard of Pinyin; today, here’s a couple of Google Apps Script custom functions to convert pinyin tone numbers to tone marks and back again. Because of my greenness in this area, I’d welcome any suggestions or corrections to the likely omissions and inaccuracies in this post.

There are two people to thank for the background for these functions –  Arnold Sakhnov, whose jQuery plugin for pinyin gave me the idea for mark placement in vowel clusters and +Marjolein Hoekstra whose project and test data this is.

You can find the functions and example data shared in this Google Spreadsheet.

What is PinYin

Pinyin is a system for representing Chinese phonetically, and is used to romanize chinese script for publication or learning, and in particular for entry with a standard keyboard. An example of pinyin is given below. You will notice that many of the vowels are marked with ‘diacritics’. These are markers used to indicate the ‘tone’ to apply to the syllable

Lǎoshī gāngcái yào wǒmen xiě yí ge “jǐ” zì. Xiǎo Lǐ běnlái yǐjing xiě duìle. Kěshì tā xiěwán yǐhòu, yòu zài “jǐ” zì de zuǒbian jiāle yí ge “mù” zì, chéngle “jī” zì. ,Zhēn shì huàshétiānzú ǎ!

Coloring

There are coloring systems used to make the tones of the syllables more obvious. Here is the same thing, using a default coloring


Lǎoshī gāngcái yào men xiě ge. Xiǎo běnlái jing xiě duìle. shì xiěwán hòu, yòu zài de zuǒbian jiāle ge, chéngle. ,Zhēn shì huàshétiān ǎ!

Tone numbers

Entering the tone markers on a keyboard would be a challenge, so another method is to use tone numbers, 1-4, each of which represent one of the possible tones of mandarin. Number 5 is sometimes used to indicate no tone.

Here’s the same phrase, this time using tone numbers in place of tone marks

Lao3shi1 gang1cai2 yao4 wo3men xie3 yi2 ge “ji3” zi4. Xiao3 Li3 ben3lai2 yi3jing xie3 dui4le. Ke3shi4 ta1 xie3wan2 yi3hou4, you4 zai4 “ji3” zi4 de zuo3bian jia1le yi2 ge “mu4” zi4, cheng2le “ji1” zi4. ,Zhen1 shi4 hua4she2tian1zu2 a3!

Initials and finals

The basic components of pinyin are ‘initials’ and ‘finals’ which form clusters of letters which correspond to a Mandarin syllable. The tone mark or number is used to guide the pronunciation of the syllable.

PINYIN functions

The Google Apps Script custom functions are

pinyinToToneNumbers()
pinyinToToneMarks()

These can be called by other scripts or used either as a custom function in a spreadsheet – for example
=pinyinToToneMarks(A2)

Additionally, these can generate html coloring code (either using a default scheme or one you specify), and because of how they have been implemented, they will also check for valid pinyin.

Parameters

Here are the usages for the given functions. Note that they can be used as array functions or as single cell functions. For example, to translate a whole column you would use =pinyinToToneMarks(A2:A)

/**
 * accents a pinyin string with optional html colorizing
 * @param {string|<array>.string} input the string to be converted
 * @param {number=} optColoringType - 0 = no coloring(default), 1 = return html with coloring options given
 * @param {string=} optColors - a comma separated string of colors to use for tones 1-5  (5=no tone)- a default will be used if omitted
 * return {<array>.string} the converted string (including colorization if required)
 */  
function pinyinToToneMarks (input, optColoringType, optColors) {
  return pinyin(input, optColoringType, optColors, false);
}


/**
 * replaces accents in pinyin string with tone numbers and adds optional html colorizing
 * @param {string|<array>.string} input the string to be converted
 * @param {number} optColoringType - 0 = no coloring(default), 1 = return html with coloring options given
 * @param {string} optColors - a comma separated string of colors to use for tones 1-5  (5=no tone)- a default will be used if omitted
 * @param {boolean} optUseV - translate to 'v' rather than ü, default is ü
 * return {<array>.string} the converted string (including colorization if required)
 */  
function pinyinToToneNumbers(input, optColoringType, optColors) {
  return pinyin(input, optColoringType, optColors, true);
}

Implementation notes

I discovered that the rules around syllables, means that there is a limited set of combinations that can make up a syllable. The validation, chopping up of words into syllables and subsequent tone conversions becomes a simple problem of figuring out an appropriate regex. I came up with this one (which may not cover all cases yet, but is working with all examples so far).

‘huan(?=gu|ge)|yuan(?=gu|ge)|jian(?=gu|ge)|’ +
    ‘zhuang|chuang|shuang|niang|liang|guang|kuang|huang|zhang|zheng|zhong|zhuai|zhuan|’ + 
    ‘chang|cheng|chong|chuai|chuan|shang|sheng|shuai|shuan|jiang|jiong|qiang|qiong|xiang|xiong|’ + 
    ‘fan(?=gu|ge)|gan(?=gu|ge)|wan(?=gu|ge)|kan(?=gu|ge)|jin(?=gu|ge)|’ +
    ‘uang|ueng|iang|iong|bang|beng|biao|bian|bing|pang|peng|piao|pian|ping|mang|meng|miao|mian|’ + 
    ‘ming|fang|feng|dang|deng|dong|duan|diao|dian|ding|tang|teng|tong|tuan|tiao|tian|ting|nang|’ + 
    ‘neng|nong|nuan|niao|nian|ning|lang|leng|long|luan|liao|lian|ling|gang|geng|gong|guai|guan|’ + 
    ‘kang|keng|kong|kuai|kuan|hang|heng|hong|huai|huan|zang|zeng|zong|zuan|cang|ceng|cong|cuan|’ + 
    ‘sang|seng|song|suan|zhai|zhei|zhao|zhou|zhan|zhen|zhua|zhuo|zhui|zhun|chai|chao|chou|chan|’ + 
    ‘chen|chua|chuo|chui|chun|shai|shei|shao|shou|shan|shen|shua|shuo|shui|shun|rang|reng|rong|’ + 
    ‘ruan|jiao|jian|jing|juan|qiao|qian|qing|quan|xiao|xian|xing|xuan|wang|weng|yang|ying|yong|’ + 
    ‘yuan|’ +
    ‘qu(?=n)|ke(?=n)|’ + 
    ‘ang|eng|ong|uai|uan|iao|ian|ing|üan|bai|bei|bao|ban|ben|bie|bin|pai|pei|pao|pou|pan|’ + 
    ‘pen|pie|pin|mai|mei|mao|mou|man|men|mie|miu|min|fan|fei|fou|fen|gan|dai|dei|dao|dou|dan|den|’ + 
    ‘duo|dui|dun|dia|die|diu|tai|tei|tao|tou|tan|tuo|tui|tun|tie|nai|nei|nao|nou|nan|nen|nuo|nun|’ + 
    ‘nie|niu|nin|nüe|lai|lei|lao|lou|lan|luo|lun|lia|lie|liu|lin|lüe|gai|gei|gao|gou|gen|gua|’ + 
    ‘guo|gui|gun|kai|kei|kao|kou|kan|ken|kua|kuo|kui|kun|hai|hei|hao|hou|han|hen|hua|huo|hui|hun|’ + 
    ‘zai|zei|zao|zou|zan|zen|zuo|zui|zun|cai|cao|cou|can|cen|cuo|cui|cun|sai|sao|sou|san|sen|suo|’ + 
    ‘sui|sun|zha|zhe|zhu|zhi|cha|che|chu|chi|sha|she|shu|shi|rao|rou|ran|ren|rua|ruo|rui|run|jia|’ + 
    ‘jie|jiu|jin|jue|jun|qia|qie|qiu|qin|que|qun|xia|xie|xiu|xin|xue|xun|ang|eng|wai|wei|wan|wen|’ + 
    ‘yao|you|yan|yin|yue|yun|ai|ei|ao|ou|an|en|ua|uo|ui|un|ia|ie|iu|in|üe|ün|ba|bo|bu|bi|pa|po|pu|’ + 
    ‘pi|ma|mo|me|mu|mi|fa|fo|fu|da|de|du|di|ta|te|tu|ti|na|ne|nu|ni|nü|la|le|lu|li|lü|ga|ge|gu|ka|’ + 
    ‘ke|ku|ha|he|hu|za|ze|zu|zi|ca|ce|cu|ci|sa|se|su|si|re|ru|ri|ji|ju|qi|qu|xi|xu|ai|ei|ao|ou|an|’ + 
    ‘en|wu|wa|wo|yi|ya|ye|yu|er|a|o|e|u|i|ü|a|o|e’

For more stuff like this, please visit Excel Liberation. If there is any interest in this topic, I may create an Excel version.

You can find the functions and example data shared in this Google Spreadsheet.

About brucemcp 225 Articles
I am a Google Developer Expert and decided to investigate Google Apps Script in my spare time. The more I investigated the more content I created so this site is extremely rich. Now, in 2019, a lot of things have disappeared or don’t work anymore due to Google having retired some stuff. I am however leaving things as is and where I came across some deprecated stuff, I have indicated it. I decided to write a book about it and to also create videos to teach developers who want to learn Google Apps Script. If you find the material contained in this site useful, you can support me by buying my books and or videos.