Unicode Howto for KDE developers
What's Unicode?
It's an encoding for all the characters and symbols
of the world. The current assigned codes are in
the range of 16bit integers. The first 256 codes
represent the same characters as in Latin-1.
For details see
http://www.unicode.org/.
Why use Unicode?
If you use the locale 8bit charset you can
create only one-language documents and interchange
them only with people using the same charset.
Unicode is the only sensible way to
mix languages in a document and to
interchange documents between people with
different locales.
So you cannot easily write a simple
russian - hungary dictionary
without using Unicode.
What means UTF-8 and UTF-16?
UTF means Unicode Transfer Format. They
define how to express unicode characters
in bits.
UTF-16 means, that every character
is represented by the 16bit value of its
Unicode number.
For example, the Latin-1 characters in UTF-16
have a hex representation of 00nn where nn
ist the hex representation in Latin-1.
UTF-8 means, that the Unicode characters
are represented by a stream of bytes.
The bytes with values from 0 to 127
coresponds to the ASCII characters.
Other characters are represeted by more
than one byte.
Because the characters "\0" and "/"
cannot occur in a multibyte character
in UTF-8, you can treat it's strings
as nullterminated C-Strings.
If you are curious, here's a simple scene how the
the bits of Unicode character codes are put into UTF-8 bytes:
bytes | bits | representation
1 | 7 | 0vvvvvvv
2 | 11 | 110vvvvv 10vvvvvv
3 | 16 | 1110vvvv 10vvvvvv 10vvvvvv
4 | 21 | 11110vvv 10vvvvvv 10vvvvvv 10vvvvvv
What's the Unicode marker?
Unicode files should be marked with two
hexbytes FEFF in the beginning. This mark
also indicates the byteorder in the file.
A file beginning with FFFE is a Unicode
file created on a machine with different
byte order.
How to support Unicode in KDE applications?
Every KDE2 application dealing with text should
support Unicode as a second choice after the
locale character set. That means, the
user of your application should be able to read
and save a document with the locale charset as
the default and with Unicode as an option.
Both UTF-8 and UTF-16 should be supported.
I think, European users would prefer UTF-8,
Asian users would prefer UTF-16.
Where can I get sample Unicode files?
Take de kde-i18n package and convert some
files from there to Unicode, e.g.
recode KOI8-R..UTF-8 ru/messages/kdelibs.po
recode ISO-8859-1..UTF-16 de/messages/kdelibs.po
recode ISO-8859-3..UTF-8 eo/messages/kdelibs.po
Where can I get Unicode fonts?
Take xfsft from
ftp://ftp.dcs.ed.ac.uk/pub/jec/programs/xfsft/.
Take some Truetype fonts from your Windows installation
or look for them on
http://www.microsoft.com
and install these fonts as usual.
Here is some piece of my fonts.scale file:
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso10646-1
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-1
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-2
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-3
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-iso8859-5
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-koi8-r
verdana.ttf -microsoft-verdana-medium-r-normal--0-0-0-0-p-0-koi8-u
...
How the user will put Unicode chars into a widget
Warwick says: The correct solution for users wanting to input
characters in the full Unicode range would be to use a utf8
locale, with a corresponding UTF8-based input method.
That work's not very well at the moment for KDE. I hope
some of the problems will be solved before the KDE 2.0 release.
An interim solution to input some special characters
is kdeutils/kcharselect. But this is not suitable
for longer texts.
How to implement Unicode in my KDE application?
As far as you use the Qt2.0 classes or the
KDE classes based on it, it's very simple to
support Unicode using the text codecs of Qt.
See the code snippets below for some
often needed tasks.
What about QString and QCString?
The class QString internally represents every
character by its 16bit value. If you read in
text pieces from or write it to an 8bit source such as
QTextStream ore QCString you have to use a QTextCodec
to convert the text from or to its 16bit representation.
You cannot hold an UTF-16 encoded string in
a QCString because it takes a null byte as
string termination, but normally there are
null bytes in an UTF-16 string (e.g. all Latin-1
equivalents consists of a null byte and the Latin-1
code)
Normally you would use the UTF-8 codec to
convert a Unicode string between QCString
and QString.
How to get a codec for Unicode?
QTextCodec utf8_codec = QTextCodec::codecForName("utf-8");
QTextCodec utf16_codec = QTextCodec::codecForName("utf-16");
How to read Unicode from a text file?
QTextStream textStream;
QString line;
// UTF-16, if the file begins with a Unicode mark
// the default is ok, otherwise:
// textStream.setEncoding(QTextStream::Unicode);
line = textStream.readLine();
...
// UTF-8
textStream.setCodec(utf8_codec);
line = textStream.readLine();
...
How to write Unicode to a text file?
QTextStream textStream;
QString line;
// UTF-16
textStream.setEncoding(QTextStream::Unicode);
textStream << line;
...
// UTF-8
textStream.setCodec(utf8_codec);
textStream << line;
...
How to read Unicode from a buffer?
char * buffer;
// UTF-16
QString string = utf16_codec->toUnicode(buffer);
// UTF-8
QString string = utf8_codec->toUnicode(buffer);
How to write Unicode to a buffer?
You should be very carefully with raw Unicode data
in a buffer, because a single 0 byte doesn't end
the string. The best way to deal with Unicode data
are the classes QString and QChar.
// UTF-16
QString string; // my Unicode text
QByteArray array;
QTextStream textStream(array,IO_WriteOnly);
textStream.setEncoding(QTextStream::Unicode);
textStream << string;
// UTF-8
QString string; // my Unicode text
char* buffer;
buffer = QString.utf8().data();
Wolfram Diestel <wolfram@steloj.de>