Unicode Howto for KDE developers

What's Unicode?

It's an encoding for all the characters and symbols of the world. The current assigned codes are in the range of 16bit integers. The first 256 codes represent the same characters as in Latin-1. For details see http://www.unicode.org/.

Why use Unicode?

If you use the locale 8bit charset you can create only one-language documents and interchange them only with people using the same charset. Unicode is the only sensible way to mix languages in a document and to interchange documents between people with different locales. So you cannot easily write a simple russian - hungary dictionary without using Unicode.

What means UTF-8 and UTF-16?

UTF means Unicode Transfer Format. They define how to express unicode characters in bits. UTF-16 means, that every character is represented by the 16bit value of its Unicode number. For example, the Latin-1 characters in UTF-16 have a hex representation of 00nn where nn ist the hex representation in Latin-1. UTF-8 means, that the Unicode characters are represented by a stream of bytes. The bytes with values from 0 to 127 coresponds to the ASCII characters. Other characters are represeted by more than one byte. Because the characters "\0" and "/" cannot occur in a multibyte character in UTF-8, you can treat it's strings as nullterminated C-Strings.

If you are curious, here's a simple scene how the the bits of Unicode character codes are put into UTF-8 bytes:

 bytes | bits | representation
     1 |    7 | 0vvvvvvv
     2 |   11 | 110vvvvv 10vvvvvv
     3 |   16 | 1110vvvv 10vvvvvv 10vvvvvv
     4 |   21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv

What's the Unicode marker?

Unicode files should be marked with two hexbytes FEFF in the beginning. This mark also indicates the byteorder in the file. A file beginning with FFFE is a Unicode file created on a machine with different byte order.

How to support Unicode in KDE applications?

Every KDE2 application dealing with text should support Unicode as a second choice after the locale character set. That means, the user of your application should be able to read and save a document with the locale charset as the default and with Unicode as an option. Both UTF-8 and UTF-16 should be supported. I think, European users would prefer UTF-8, Asian users would prefer UTF-16.

Where can I get sample Unicode files?

Take de kde-i18n package and convert some files from there to Unicode, e.g.

recode KOI8-R..UTF-8      ru/messages/kdelibs.po
recode ISO-8859-1..UTF-16 de/messages/kdelibs.po
recode ISO-8859-3..UTF-8  eo/messages/kdelibs.po

Where can I get Unicode fonts?

Take xfsft from ftp://ftp.dcs.ed.ac.uk/pub/jec/programs/xfsft/. Take some Truetype fonts from your Windows installation or look for them on http://www.microsoft.com and install these fonts as usual.

Here is some piece of my fonts.scale file:

verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso10646-1
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-1
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-2
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-3
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-5
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-koi8-r
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-koi8-u       
...

How the user will put Unicode chars into a widget

Warwick says: The correct solution for users wanting to input characters in the full Unicode range would be to use a utf8 locale, with a corresponding UTF8-based input method.

That work's not very well at the moment for KDE. I hope some of the problems will be solved before the KDE 2.0 release. An interim solution to input some special characters is kdeutils/kcharselect. But this is not suitable for longer texts.

How to implement Unicode in my KDE application?

As far as you use the Qt2.0 classes or the KDE classes based on it, it's very simple to support Unicode using the text codecs of Qt. See the code snippets below for some often needed tasks.

What about QString and QCString?

The class QString internally represents every character by its 16bit value. If you read in text pieces from or write it to an 8bit source such as QTextStream ore QCString you have to use a QTextCodec to convert the text from or to its 16bit representation. You cannot hold an UTF-16 encoded string in a QCString because it takes a null byte as string termination, but normally there are null bytes in an UTF-16 string (e.g. all Latin-1 equivalents consists of a null byte and the Latin-1 code) Normally you would use the UTF-8 codec to convert a Unicode string between QCString and QString.

How to get a codec for Unicode?

QTextCodec utf8_codec  = QTextCodec::codecForName("utf-8");
QTextCodec utf16_codec = QTextCodec::codecForName("utf-16");

How to read Unicode from a text file?

QTextStream textStream;
QString line;

// UTF-16, if the file begins with a Unicode mark
// the default is ok, otherwise:
// textStream.setEncoding(QTextStream::Unicode);
line = textStream.readLine();
...

// UTF-8
textStream.setCodec(utf8_codec);
line = textStream.readLine();
...

How to write Unicode to a text file?

QTextStream textStream;
QString line;

// UTF-16
textStream.setEncoding(QTextStream::Unicode);
textStream << line;
...

// UTF-8
textStream.setCodec(utf8_codec);
textStream << line;
...

How to read Unicode from a buffer?

char * buffer;

// UTF-16
QString string = utf16_codec->toUnicode(buffer);

// UTF-8
QString string = utf8_codec->toUnicode(buffer);

How to write Unicode to a buffer?

You should be very carefully with raw Unicode data in a buffer, because a single 0 byte doesn't end the string. The best way to deal with Unicode data are the classes QString and QChar.

// UTF-16
QString string; // my Unicode text
QByteArray array;
QTextStream textStream(array,IO_WriteOnly);

textStream.setEncoding(QTextStream::Unicode);
textStream << string;

// UTF-8
QString string; // my Unicode text
char* buffer;

buffer = QString.utf8().data();

Wolfram Diestel <wolfram@steloj.de>