Multibyte functions in useful in Far East countries, such as China (including Taiwan and Hong Kong), Korea, and Japan. The code pages of these countries are generally compliant with ASCII when the character codes are less than 0x80, and the meaning of the codes (and perhaps the code immediately follows) vary when they are greater than 0x80. For example, the name of the current Chinese Premiere, Mr Zhu Rongji, is encoded in Codepage 936 (GBK) as 0xD6EC 0xE946 0xBBF9. Please note that the second character Rong is encoded as 0xE946, the second byte being lower than 0x80 (However, the most common characters in CP 936 have both bytes greater than 0x80).
To use multibyte functions, the first step is to set the code page. Of course, if you are using the OS default code page, this step could be omitted. You should use the _setmbcp function.
#include <locale.h>It is trivial to mention that the _getmbcp function is used to get the code page:
#include <mbctype.h>
#include <stdio.h>int main(int argc, char *argv[])
{
setlocale(LC_ALL, "Chinese_Taiwan");
_setmbcp(_MB_CP_LOCALE);
printf("%d\n", _getmbcp());
return 0;
}
Of these routines I only want to explain ismbblead and ismbbtrail. The former tests whether an integer is the first byte of a multibyte character, and the latter tests whether an integer is the second byte of a multibyte character. The functions _ismbslead and _ismbstrail (they have no inline versions) are also interesting that they test whether a given character in a string is lead byte or trail byte depending on context. For example,
#include <mbctype.h>The output is 4 (or 1 if use the function version of _ismbblead) and 0, which shows that 0354 (0xEC) can be the lead byte in a Chinese string, but here it is not the lead byte.
#include <stdio.h>int main(int argc, char *argv[])
{
char s[] = "\326\354"; /* ZHU1 */
_setmbcp(936); /* Chinese GBK */
printf("%d\n", _ismbblead(s[1]));
printf("%d\n", _ismbslead(s, &s[1]));
return 0;
}
More interesting multibyte string functions are defined in mbstring.h. They include equivalents to standard string routines (taking into account the properties of multibyte characters), character type routines, and conversion routines. We shall name one from each category for explanation.
#include <mbctype.h>The output is 2 and 1. Another _mbstrlen (defined in stdlib.h) use the locale (instead of code page) information to count the number of multibyte characters.
#include <mbstring.h>
#include <stdio.h>
#include <string.h>int main(int argc, char *argv[])
{
char s[] = "\326\354"; /* ZHU1 */
_setmbcp(936); /* Chinese GBK */
printf("%d\n", strlen(s));
printf("%d\n", _mbslen(s));
return 0;
}
/* mbstowcs.c - Demonstration of the mbstowcs function */The output is#include <stdio.h>
#include <stdlib.h>
#include <locale.h>int main()
{
wchar_t wcstr[64];
char mbstr[] = "ABC - A Unicode conversion test";
size_t n, i;// Set locale to the codepage of Simplified Chinese
//
// NB: mbstowcs is affected by locale information. By the way, the
// MBSTOWCS example given in MSDN makes no sense at all.
//
setlocale(LC_ALL, "English");// Convert the string
n = mbstowcs(wcstr, mbstr, _mbstrlen(mbstr));// Output the Unicode string to the screen
for( i = 0; i < n; i++ )
printf( "%.4X\t", wcstr[i] );return 0;
}
0041 0042 0043 0020 002D 0020 0041 0020 0055 006EIt seems logical, but not very interesting, right? I originally used Chinese characters, but Chinese characters cannot display in a ISO-8859-1 page :-(. However, if you are using a Far East-capable version of Windows (I confirmed that English Windows 95 will not do), you may try giving a suitable value to mbstr, and setting the appropriate locale. The result will be more interesting.
0069 0063 006F 0064 0065 0020 0063 006F 006E 0076
0065 0072 0073 0069 006F 006E 0020 0074 0065 0073
0074
OK. I am a little tired now. Enough for today.
2001-12-16, written by Wu Yongwei