Character Encoding... A few words on the subject

Character Encoding... A few words on the subject

article code

Contents
• What is Unicode? • XML and character encoding • XSLT and the Output • Other tools, how they handle this? • Objects/Interfaces encoding • Final Comment • Resources

What is Unicode?

I've seen lots of confusion, and even the documentation sometimes brings a misleading belief of what in fact is Unicode.

Unicode is a worldwide character-encoding standard, published by the Unicode Consortium. Computers store numbers that represent a character, Unicode provides a unique number for every character. Unicode is compatible with the ISO/IEC 10646.

Now come the need to encode such characters and that made available several ways of doing just that, there are several ways of representing the same characters

So what the heck mean UCS-* and UTF-*

UCS is just the acronym of Universal Character Set specified by the ISO/IEC 10646. The number that accompanies this acronym indicates the number of octets that the character set is coded. UCS-2 (two octets) and UCS-4 (four octets)

UTF is the acronym for Unicode Transformation Format

In order to be compatible with older systems that didn't support Unicode, the Unicode Consortium, defined what they call Encoding Forms, that is the representation of the character in bits, the number that accompanies this acronym indicates the encoding form that is to be used. So UTF-8 (8-bit encoding form), UTF-16 (16-bit encoding form). Be aware that this doesn't mean that in the case of the UTF-8 it will always use only 8 bits.

UTF use bit-shifting techniques to encode Unicode characters, UTF-8 encodes each Unicode character as variable number of bytes (usually 3), a particularity of UTF-8 is that character below the 127 range in ASCII have the same byte values in ASCII and UTF-8. UTF-16 always encodes each Unicode character as two bytes.

Little-Endian and Big-Endian

Some microprocessors differ in the use of the least significant byte, So you have a different BOM (see below) to indicate this byte order. Sometimes you will see the UTF-16LE and UTF-16BE denomination to indicate this byte order.

What is a BOM?

BOM (Byte Order Mark), this is a encoding signature for the file. A particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order.

UTF-8 - EF BB BF - ï»ż

UTF-16LE - FF FE - ÿþ

UTF-16BE - FE FF - þÿ

This encoding signature is not to be displayed, any tool that support Unicode will understand this and will not show this to you nor consider it to be part of the text file. So you also have a problem that is how can I find out what BOM is being used. Well checking the Hexadecimal of the file or opening the file in a non-unicode text editor will give you those characters presented in the above table.

How are character represented

It’s important to understand how characters are encoded, all my samples use the character í. These characters along with other accentuated characters are very common on my natural language. I’ve opted for this one because it’s easy for me to write it, and it’s above the 127 range in ASCII, making him two bytes in UTF-8 (this will be important to show up some problems).

I've made this simple table to try and explain all this, hopefully you will understand it a little bit better…

Chr	dec	hex	UCS dec	UCS hex	UTF8 dec	UTF8 hex	one by one	Unicode Name
í	146	92	237	00ED	195.173	C3AD	Ã-	Latin Small Letter I with Acute

For more information take a look at the following links.
http://www.unicode.org/unicode/standard/WhatIsUnicode.html
http://www.unicode.org/unicode/standard/principles.html

XML and character encoding

The W3C specify that all XML processor must read the UTF-8 and UTF-16 encoding. This text explain that to differentiate between UTF-8 and UTF-16 a BOM must be present, and that the BOM must be used by the parser as encoding signature. Other encoding may be supported, but no parser is required to have support for all of them, or one in particular, besides the UTF-8 and UTF-16. Well if this text is not enough to convince you that you should be using Unicode for all your XML files I don't known what else can I say. Well you should really give a good enough though every time you think using anything else besides Unicode (UTF-8 or UTF-16).

With IE and MSXML, there are two really common errors that happen when something is not correct with the three steps defined below.

An invalid character was found in text content (see my sample file error1.xml)

A parser found a character on your file that is not according the encoding declaration or the BOM specified for that file.

In the sample I've the character encoded with ISO-8859-1 and then speficy the UTF-8 on the encoding declaration, this will issue the error.

The reason that it doesn't issue the error below is that since the character is encoded with a one byte long encoding (ISO-8859-1), the parser at the beginning find this a valid character, finding later that can't understand the character that was used.

Switch from current encoding to specified encoding not supported

In a basic thinking this error is almost identical to the previous, the only thing is that the parser understands that the real encoding on the file is diferent from the one in the encoding declaration.

What the error is trying to tell you is that it can’t make the switch from the file encoding to the one you specify on the encoding declaration.

In error2.xml I've a file encoded as UTF-8 and a text encoding of UTF-16, since UTF-16 must always be two bytes, the parser known forehand that something is wrong with the encoding, the error3.xml is the same problem the other way around.

Declarations

XML and Text Declaration


[23] XMLDecl ::= ’<?xml’ VersionInfo [p.12] EncodingDecl
                 [p.38] ? SDDecl [p.15] ? S [p.8] ? ’?>’
[77] TextDecl ::= ’<?xml’ VersionInfo [p.12] ? EncodingDecl
                  [p.38] S [p.8] ? ’?>’

sample: <?xml version="1.0" encoding ='UTF-8'?>

The encoding declaration is part of the text declaration, in this form:


[80] EncodingDecl ::= S [p.8] ’encoding’ Eq
                      [p.12] (’"’ EncName [p.38]
                      ’"’ | "’" EncName [p.38]
                      "’" )

sample: encoding ='UTF-8'

Auto-Detecting the encoding

The W3C defined a non-normative (Appendix F).

There are 3 steps rules to check this.

If there is a BOM, then this defines the file encoding.
If there isn't then look at the Encoding Declaration.
If none of the above meet then assume UTF-8.

Since all XML files must start with a text declaration and that start with <? You can look at those characters to try and find out which encoding is used.

So what about character entities?

Character entity permits you to write a character without actually having to write the character. The drawback is that you don't want to actually write a character entity for each of those accentuated characters, this maybe nice feature for a once-in-time character like the trademark sign, but not for writing actual language.

You can specify your entities in three ways:

By name í
By decimal number í
By hexadecimal number í

See the sample file entities.xml

For more information take a look at the following links.
http://www.w3.org/TR/REC-xml#charencoding
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

XSLT and the Output

Since XSLT is used to transform XML into other files formats (including XML), it's important to be able to specify the encoding that the developer of the XSLT wants on the destination.

The W3Cdefines a <xsl:output> element for this propose, and a encoding attribute to specify the pretended encoding of the output.


<!-- Category: top-level-element -->
     <xsl:output
          method = "xml" | "html" | "text" | qname-but-not-ncname 
          version = nmtoken
          encoding = string
          omit-xml-declaration = "yes" | "no"
          standalone = "yes" | "no"
          doctype-public = string
          doctype-system = string
          cdata-section-elements = qnames
          indent = "yes" | "no"
          media-type = string
     />

A XSLT Processor may perform the serialization of the result tree, but it's not required to. In the situation where the XSLT Processor does make the serialization of the result tree, it should, but it's not required to, to respect the definition in the <xsl:output> element.

For more information take a look at the following links.
http://www.w3.org/TR/1999/REC-xslt-19991116#output

Other tools, how they handle this?

The Unicode Consortium defines a set of rules for a Unicode enabled application, every application that claim to be unicode compliant should observe this set of rules.

The problem is that exist a large number of tools, applications and operating systems that still aren't Unicode compliant, so whenever you are trying to handle unicode files, certify yourself that you are using a unicode compliant application.

One of my first problems concerning Unicode and XML was because I was using a text editor that didn't support Unicode. One of these editors is the windows 95/98 notepad, this tool doesn't support Unicode so it will always write the file as ASCII based on the codepage that you have on selected on your Windows environment. The notepad that ships with Windows 2000 already has support for Unicode, you can define the encoding choosing one of the options in the encoding combo box, options are (ANSI, Unicode, Unicode Big-Endian and UTF-8).

Beware of the interpretation of MS of the word Unicode, almost every time that MS refers to Unicode they mean UTF-16.

This is referring to the part of saving files, when you are opening them this too can bring problems. Windows 2000 Notepad will read the BOM, and if present will respect it, but if no BOM is present, then we will scan the file trying to guest is encoding. If it decide this is an UTF-8 encoded, when you save it will put a BOM in the file.

Even if the tool support unicode, it can have different implementation, on the grey areas, for instance the editor I'm using, if no BOM is present and you open a UTF-8 encoded file, it will not scan the file to see if it's UTF-8 or not, like notepad 2000 does. So if during the edition you write a character like the "í" it will encode that according Windows codepage in a single byte encoding form.

So it's important that you check out your editor specification how it handles unicode, you can save yourself a lot of problem if you understand how it does it.

Objects/Interfaces encoding

Another frequent problem with encoding, are the objects/interfaces, and the way they handle character encoding. Well we could extend this article to more objects/interfaces, but since that would take forever. I recommend you to read careful the documentation of those and check out how they handle this situation, and even so, you will find that sometimes documentation is wrong, but that is maybe just a small part of it, even so you really must try it yourself.

MSXML 3sp1

DOMDocument

Load/ LoadXml

Those two methods serve to load a xml into the DOMDocument.

Carefull that the MSXML 3 SDK documentation, mention there that the LoadXml, only handles a UCS-2 or UTF-16 encoded string, but if you look at the KB Article Q275883 http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP you will noticed that is says that this restriction is removes starting on MSXML 3.0.

Sub DOMLoadXml()
Dim docOut As MSXML2.DOMDocument
Dim stm As ADODB.stream
Dim strPath As String
Dim bolRet As Boolean
strPath = GetPath(CurrentDb.Name)
Set stm = New ADODB.stream
stm.Open
stm.Charset = "UTF-8"
stm.LoadFromFile strPath & "encUTF8.xml"
'Juts to check it out
MsgBox stm.Charset
stm.Position = 0
Set docOut = New MSXML2.DOMDocument
bolRet = docOut.loadXML(stm.ReadText)
If Not bolRet Then
MsgBox "The xml string was not loaded"
End If
'Just making sure
On Error Resume Next
Kill strPath & "test.xml"
docOut.Save strPath & "test.xml"
End Sub

If you check the test.xml file you will see that the character "í" is encoded as 0xC3AD, meaning that it's been read from UTF-8 and saved in UTF-8 too.

Save

The save method bases the decision of the character encoding in the encoding declaration. I've seen that in the previous snippet the file was saved in UTF-8, let see if it does is job with other encoding. If no encoding declaration is found then it will be encoded as UTF-8.

Private Sub DOMSave()
Dim docOut As MSXML2.DOMDocument
Dim domPI As MSXML2.IXMLDOMProcessingInstruction
Dim eleRoot As MSXML2.IXMLDOMElement
Dim eleChild As MSXML2.IXMLDOMElement
Dim strPath As String
Set docOut = New MSXML2.DOMDocument
Set domPI = docOut.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'")
docOut.appendChild domPI
Set eleRoot = docOut.createElement("Character")
docOut.appendChild eleRoot
Set eleChild = eleRoot.appendChild(docOut.createElement("chr"))
eleChild.setAttribute "ShipAddress", "í"
strPath = GetPath(CurrentDb.Name)
On Error Resume Next
Kill strPath & "test.xml"
docOut.Save strPath & "test.xml"
End Sub

If you check the test.xml file you will see that the file as a BOM (0xFFFE) and that the character "í" is encoded as 0xED00, meaning that it's been saved as UTF-16.

xml

The xml read-only property retrieves the node and all is child’s.

The part that concerns this article, this property will remove the encoding declaration and any BOM, of any file that is not UTF-16 encoded, this is handy to pass to the LoadXml.

Private Sub DOMxml()
Dim docIn As MSXML2.DOMDocument
Dim strPath As String
Dim lpBuffer() As Byte
strPath = GetPath(CurrentDb.Name)
Set docIn = New MSXML2.DOMDocument
docIn.async = False
docIn.Load strPath & "encUTF8.xml"
MsgBox docIn.xml
lpBuffer = docIn.xml
Debug.Print "0x" & Hex(lpBuffer(0)) & Hex(lpBuffer(1)) & Hex(lpBuffer(2)) & Hex(lpBuffer(3))
End Sub

Looking at the first 4 bytes of this string show us that indeed it’s encoding was changed to UTF-16.

transformNode

The transformNode Method takes a node and is children’s and transforms them according to the supplied XSLT stylesheet given.

This will always return a string encoded as UTF-16 and will add the Encoding declaration with UTF-16 too, that way we doesn’t observe what was specified in the xsl:output element of the XSLT stylesheet.

Sub DOMTransformNode()
Dim docXML As MSXML2.DOMDocument
Dim docXSL As MSXML2.DOMDocument
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set docXML = New MSXML2.DOMDocument
docXML.async = False
docXML.Load strPath & "encUTF8.xml"
Set docXSL = New MSXML2.DOMDocument
docXSL.async = False
docXSL.Load strPath & "output_UTF8.xsl"
MsgBox docXML.transformNode(docXSL)
End Sub

transformNodeToObject

This does method does almost the same thing as transformNode, but it redirects the output to a object that is passed as the second parameter, in the topic of this text, I’ve to say that it’s much better, since transformNodeToObject, does respect the xsl:output element and it’s encoding attribute. So you can pass a UTF-8 file and convert it to UTF-16 and vice-versa.

Sub DOMTransformNodeToObject()
Dim docXML As MSXML2.DOMDocument
Dim docXSL As MSXML2.DOMDocument
Dim docOut As MSXML2.DOMDocument
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set docXML = New MSXML2.DOMDocument
docXML.async = False
docXML.Load strPath & "encUTF16.xml"
Set docXSL = New MSXML2.DOMDocument
docXSL.async = False
docXSL.Load strPath & "encUTF8.xml"
Set docOut = New MSXML2.DOMDocument
Call docXML.transformNodeToObject(docXSL, docOut)
docOut.Save strPath & "test.xml"
End Sub

IXSLProcessor

transform / output

Another way to transform a xml file with a xslt stylesheet is using the IXSLProcessor. The transform method it’s the way to start the transformation, so what we are really interrested in here is how the output property works. This property can lead to different results depending in the way that you read it’s content, MSXML 3 SDK gives a brief note where they state that if you read sequentially from the output property then the output generated is always UTF-16 encoded and the xsl:output element is ignored.

Sub TranfWithIXSLProcessor()
Dim xtp As MSXML2.XSLTemplate
Dim xpr As MSXML2.IXSLProcessor
Dim ftdXSL As MSXML2.FreeThreadedDOMDocument
Dim docXML As MSXML2.DOMDocument
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set docXML = New MSXML2.DOMDocument
docXML.async = False
docXML.Load strPath & "encUTF8.xml"
Set ftdXSL = New MSXML2.FreeThreadedDOMDocument
ftdXSL.async = False
ftdXSL.Load strPath & "Output_UTF8.xsl"
Set xtp = New MSXML2.XSLTemplate
Set xtp.stylesheet = ftdXSL
Set xpr = xtp.createProcessor()
xpr.input = docXML
xpr.transform
MsgBox xpr.output
End Sub

So the result will be UTF-16 encoded, and a encoding declaration is added too.

BUT, you can redirect your output to a object, for instance a ADOStream and that way the transformation will respect the xsl:output element and is encoding attribute.

Sub TranfWithIXSLProcessorToStream_v2()
Dim xtp As MSXML2.XSLTemplate
Dim xpr As MSXML2.IXSLProcessor
Dim ftdXSL As MSXML2.FreeThreadedDOMDocument
Dim docXML As MSXML2.DOMDocument
Dim stmOut As ADODB.stream
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set docXML = New MSXML2.DOMDocument
docXML.async = False
docXML.Load strPath & "encUTF16.xml"
Set ftdXSL = New MSXML2.FreeThreadedDOMDocument
ftdXSL.async = False
ftdXSL.Load strPath & "Output_UTF8.xsl"
Set xtp = New MSXML2.XSLTemplate
Set xtp.stylesheet = ftdXSL
Set xpr = xtp.createProcessor()
xpr.input = docXML
Set stmOut = New ADODB.stream
stmOut.Open
stmOut.Charset = "UTF-8"
xpr.output = stmOut
xpr.transform
stmOut.Position = 0
MsgBox stmOut.ReadText
End Sub

MXXMLWriter and SAXXMLReader

Several Properties

MXXMLWriter was a nice variety of properties that let you control how the output is going to be, beware that this large flexibility, can bring you unusable files.

The first thing to understand is the output, by default this will be redirected to a string, but you can specify any object that supports the IStream interface. When using this from VB you must understand the the strings in VB are always UTF-16 encoded, so if you want your file encoded in a diferent way you will have to go with the IStream interface. So if you have the output redirected to a string the folowing properties will be ignored (byteOrderMark and encoding).

I will be using the ADO Stream, to show you this properties.

OmitXMLDeclaration - Control if the Text Declaration is outputted or not.

ByteOrderMark - Control if the BOM is outputted or not.

encoding - Defines the output encoding.

Private Sub SaxReaderWriter()
Dim rdr As New MSXML2.SAXXMLReader
Dim wrt As New MSXML2.MXXMLWriter
Dim stm As ADODB.stream
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set rdr.contentHandler = wrt
wrt.omitXMLDeclaration = False
wrt.byteOrderMark = True
wrt.encoding = "UTF-8"
Set stm = New ADODB.stream
stm.Open
'You need to set the Stream Chartset, explain below
stm.Charset = "UTF-8"
wrt.output = stm
rdr.parseURL strPath & "encUTF16.xml"
wrt.Flush
stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
End Sub

On the SAXXMLReader you can obtain this properties by using the getProperty and you can insert them by using the putProperty.

ADO 2.6

Recordset

save (adPersistXML)

ADO uses UTF-8 when saving a file using the adPersistXML.

Sub PersistADOtoXML()
Dim cnn As ADODB.Connection
Dim rst As ADODB.Recordset
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set cnn = New ADODB.Connection
cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & CurrentDb.Name & ";"
Set rst = New ADODB.Recordset
rst.Open "SELECT * FROM tblTest", cnn
On Error Resume Next
Kill strPath & "test.xml"
rst.Save strPath & "test.xml", adPersistXML
rst.Close
Set rst = Nothing
End Sub

Stream

LoadFromFile / SaveToFile

LoadFromFile method loads the content of a file in the stream, you can load any encoding you want, but by default when you open a stream object it’s default encoding is UTF-16. so if you wish to load a file in another encoding you must change the charset of the stream.

Sub ReadSaveFileInUTF8()
Dim stm As ADODB.stream
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set stm = New ADODB.stream
stm.Open
stm.Charset = "UTF-8"
stm.LoadFromFile strPath & "encUTF8.xml"
stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
End Sub

Charset

The default value for the charset of a stream is UTF-16, and you can set this property to tell the stream that it’s receiving a that type of encoded text, I’ve used this in the previous sample to indicate that I was loading a UTF-8 encoded file.

One of the other properties that the documentation states is that this property can do the translation from one encoding form to another, well in some samples that I’ve made this can cause you serious troubles, and leave you with a useless file. For instance let’s say that you want to convert a UTF-16 file to UTF-8, you could use some code like this

Sub TranlateToUTF8()
'ERROR Don't Use this code, jut to expose the error
Dim stm As ADODB.stream
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set stm = New ADODB.stream
stm.Open
stm.LoadFromFile strPath & "encUTF16.xml"
stm.Charset = "UTF-8"
stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
End Sub

Well this will really damage your file, besides de fact that you now habe two BOM’s on for UTF-8 0xEF BB BF and the one that allready was there for UTF-16 0xFF FE, and the file is still two bytes (UTF-16 encoded).

Mixing both the recordset object and the stream object, would by KB Article http://support.microsoft.com/support/kb/articles/Q259/5/55.ASP, lead to a problem, and that’s understandable explanation on that KB article, but with the version of ADO (2.60.6526.0) that I’ve I couldn’t reach the problem described on the text.

Sub PersistStreamToXML()
Dim cnn As ADODB.Connection
Dim rst As ADODB.Recordset
Dim stm As ADODB.stream
Dim strPath As String
strPath = GetPath(CurrentDb.Name)
Set cnn = New ADODB.Connection
cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
"Data Source=" & CurrentDb.Name & ";"
Set rst = New ADODB.Recordset
rst.Open "SELECT * FROM tblTest", cnn
Set stm = New ADODB.stream
rst.Save stm, adPersistXML
rst.Close
Set rst = Nothing
MsgBox stm.Charset
stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
Set stm = Nothing
End Sub

This output a perfect UTF-16 encoded file…

Final Comment

Well I could go forever with this part of the article, since there are so many objects and interfaces, that is impossible to cover them all in just one article, I've tried to talk about the more obvious ones, that at least I'm using. Also I made a deliberated choice of only describe the MSXML parser, other will behave differently, it's up to you to be advise to read their documentation, and hope no to be bitten by encoding problem. I'm aware that it was just a few words on the subject as the titles of this state.

RESOURCES:

From all of them I would like to point out Michael Kaplan Book, if you don't have it IMHO it's a very helpful resource, with the i18n subject

http://www.i18nwithvb.com/

http://www.amazon.com/exec/obidos/ASIN/0672319772/002-9388292-4866449

A good article "XML International" in XML Developer (October 2000), by Michael Kaplan.

http://www.xmldevelopernewsletter.com

I've also found the article by Chris Lovett in MSDN a good resource.

http://msdn.microsoft.com/library/en-us/dnxml/html/xmlencodings.asp

MSXML 3 SDK download it at

http://download.microsoft.com/download/xml/SDK/3.0/WIN98Me/EN-US/xmlsdk.exe

KB Article http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP

Fundamental for Specifications

http://www.unicode.org/

http://www.w3.org/

Any comment and problem's please send a e-mail to pmpg98@hotmail.com

Print Version

Home

XML/XSLT Examples