Unicode and Pocket PC

As a multilingual person, I am interested in looking at how the computer system that I use can support multiple languages. One of the key towards multiple-language support is Unicode. Here are some information that I found during my investigation into Unicode.

1. What is Unicode? Why Unicode?

Unicode is basically a way to encode characters.

Digital computers handle only bits (on-off). Therefore to represent information like text, there must be some way to encode the text into bits. In the 60s, before microcomputers became common, American Standard Code for Information Interchange (ASCII) was developed for use in communication equipments. ASCII uses 7-bit to encode up to 128 different characters. It encodes all 26 English alphabets, punctuation marks and some other control commands, so it seems logical to use the same encoding scheme on microcomputer systems. However, the IBM PC system started as an 8-bit system. Therefore when ASCII is used, there is an extra bit available which can be used to encode another additional 128 different characters. These additional characters do not belong to ASCII and are not standard. In fact, even the non-character items in ASCII are often replaced. As a result, this encoding, although compatible with ASCII in the displaying text, is not the same as ASCII.

The original encoding scheme used in computers does not include characters used in West European languages, like 'é' or 'ö'. American National Standards Institute (ANSI) later adopted a character set that includes those characters as well. It is commonly known as the ANSI character set or ISO-8859-1. For other languages like Greek or Russian that do not use Latin alphabets, an 8-bit encoding scheme is also used. This means that the same 8-bit pattern could represent different characters depending on what script (Latin, Cyrillic, Greek, etc.) the pattern is supposed to encode. Thus codepage is introduce. Codepage tells what script the encoding is supposed to represent. Still other languages, like Chinese or Japanese, have more characters than can be encoded with 8 bits. There are 2 ways to deal with this. One would be to use double byte (16-bit) encoding. The alternative is to use variable bit size to encoded the characters - some characters are 8-bit encoded, others are 16, 24 or even 32-bit encoded.

Multiple encoding systems like these have some obvious limitations. Codepage dependency might cause erroneous result or data corruption. For example, if a wrong codepage is applied to the data and processing (sorting, etc.) is then done, the result might be erroneous. One has to remember that there are many codepages, not only for encoding different scripts, but also for the same script. Examples for Russian codepages are KOI8-U, KIO8-R; common Chinese codepages are Big5, GB. (For Latin script, apart from ANSI character set, encoding scheme like IBM's EBCDIC exists, making data transfer open to risks of corruption.) Another limitation resulting from codepage can be seen in a web browser. A web page can be set to only one codepage, making multilingual content display a challenge.

Unicode addresses the shortcomings of these systems. The encoding has a unique entry for each different character, and the encoding covers a lot of scripts, even extending to mathematical symbols. In essence, it looks like a single big "codepage" which can cover any character we need. In addition, Unicode has other features like bidirectionality which is used for Arabic language for example.

2. What is UTF-8?

UTF stands for Unicode transformation format. UTF specifies the byte sequence of the character encoding. Unicode can be thought of like a big table with each entry numbered. Characters are assigned position in this big table. For example, we might want to put ';' in the 23rd position, 't' in the 40th position and so on. The correct way to specify the position (code point) is in the form of U+xxxx, where xxxx is in hexadecimal. Unicode can range from U+0000 to U+10FFFF.

UTF on the other hand, specifies the bit pattern used to represent the code points. It is actually an algorithmic mapping from every Unicode scalar value to a unique byte sequence. There are a few representation formats like UTF-8, UTF-16, UTF-32.

3. Unicode and Windows systems

Windows NT and Windows CE systems are Unicode enabled. All the strings are stored and operated on as Unicode, with the option of conversion to ANSI characters in between if necessary. This means that they can display all major languages without difficulty provided the necessary fonts are present. With increasing globalisation, it is important that software can be localised, serving global customers. Localising software is quite a complex issue that includes not only translating text, but also text input systems, sorting order, date/time display rendering and so on. Microsoft realises this, and even has a website dedicated to software localisation issues. It is interesting to note that some versions of Windows 2000 and Windows XP Professional are developed so that the whole user interface can be localised simply by loading in the appropriate resource pack.

4. Pocket PC Specifics

Pocket PC is Unicode-based, since it is built on top of Windows CE. Unfortunately, ActiveSync is the weakest link here. There is only one version of ActiveSync which runs on both Windows NT-based and Windows 9x-based system. Windows 9x systems are not Unicode-based, so ActiveSync itself is unable to handle Unicode. For example files with non-latin script filenames cannot be transferred through ActiveSync. The correct codepage has to be set in order to display and transfer the file.

Issues regarding Windows Media Player for Pocket PC can be found here, while information about reading Chinese/Japanese on PPC can be found here.