Character Encoding... A few words on the subject
article code
Contents

• What is Unicode?
• XML and character encoding
• XSLT and the Output
• Other tools, how they handle this?
• Objects/Interfaces encoding
• Final Comment
• Resources


What is Unicode?

I've seen lots of confusion, and even the documentation sometimes brings a misleading belief of what in fact is Unicode.

Unicode is a worldwide character-encoding standard, published by the Unicode Consortium. Computers store numbers that represent a character, Unicode provides a unique number for every character. Unicode is compatible with the ISO/IEC 10646.

Now come the need to encode such characters and that made available several ways of doing just that, there are several ways of representing the same characters

So what the heck mean UCS-* and UTF-*

UCS is just the acronym of Universal Character Set specified by the ISO/IEC 10646. The number that accompanies this acronym indicates the number of octets that the character set is coded. UCS-2 (two octets) and UCS-4 (four octets)

UTF is the acronym for Unicode Transformation Format

In order to be compatible with older systems that didn't support Unicode, the Unicode Consortium, defined what they call Encoding Forms, that is the representation of the character in bits, the number that accompanies this acronym indicates the encoding form that is to be used. So UTF-8 (8-bit encoding form), UTF-16 (16-bit encoding form). Be aware that this doesn't mean that in the case of the UTF-8 it will always use only 8 bits.

UTF use bit-shifting techniques to encode Unicode characters, UTF-8 encodes each Unicode character as variable number of bytes (usually 3), a particularity of UTF-8 is that character below the 127 range in ASCII have the same byte values in ASCII and UTF-8. UTF-16 always encodes each Unicode character as two bytes.

Little-Endian and Big-Endian

Some microprocessors differ in the use of the least significant byte, So you have a different BOM (see below) to indicate this byte order. Sometimes you will see the UTF-16LE and UTF-16BE denomination to indicate this byte order.

What is a BOM?

BOM (Byte Order Mark), this is a encoding signature for the file. A particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order.

UTF-8 - EF BB BF - 

UTF-16LE - FF FE - ÿþ

UTF-16BE - FE FF - þÿ

This encoding signature is not to be displayed, any tool that support Unicode will understand this and will not show this to you nor consider it to be part of the text file. So you also have a problem that is how can I find out what BOM is being used. Well checking the Hexadecimal of the file or opening the file in a non-unicode text editor will give you those characters presented in the above table.

How are character represented

It’s important to understand how characters are encoded, all my samples use the character í. These characters along with other accentuated characters are very common on my natural language. I’ve opted for this one because it’s easy for me to write it, and it’s above the 127 range in ASCII, making him two bytes in UTF-8 (this will be important to show up some problems).

I've made this simple table to try and explain all this, hopefully you will understand it a little bit better…

Chr

dec

hex

UCS

dec

UCS hex

UTF8

dec

UTF8

hex

one by one

Unicode Name

í

146

92

237

00ED

195.173

C3AD

Ã-

Latin Small Letter I with Acute

For more information take a look at the following links.
http://www.unicode.org/unicode/standard/WhatIsUnicode.html
http://www.unicode.org/unicode/standard/principles.html

Back to Top

XML and character encoding

The W3C specify that all XML processor must read the UTF-8 and UTF-16 encoding. This text explain that to differentiate between UTF-8 and UTF-16 a BOM must be present, and that the BOM must be used by the parser as encoding signature. Other encoding may be supported, but no parser is required to have support for all of them, or one in particular, besides the UTF-8 and UTF-16. Well if this text is not enough to convince you that you should be using Unicode for all your XML files I don't known what else can I say. Well you should really give a good enough though every time you think using anything else besides Unicode (UTF-8 or UTF-16).

With IE and MSXML, there are two really common errors that happen when something is not correct with the three steps defined below.

An invalid character was found in text content (see my sample file error1.xml)

A parser found a character on your file that is not according the encoding declaration or the BOM specified for that file.

In the sample I've the character encoded with ISO-8859-1 and then speficy the UTF-8 on the encoding declaration, this will issue the error.

The reason that it doesn't issue the error below is that since the character is encoded with a one byte long encoding (ISO-8859-1), the parser at the beginning find this a valid character, finding later that can't understand the character that was used.

Switch from current encoding to specified encoding not supported

In a basic thinking this error is almost identical to the previous, the only thing is that the parser understands that the real encoding on the file is diferent from the one in the encoding declaration.

What the error is trying to tell you is that it can’t make the switch from the file encoding to the one you specify on the encoding declaration.

In error2.xml I've a file encoded as UTF-8 and a text encoding of UTF-16, since UTF-16 must always be two bytes, the parser known forehand that something is wrong with the encoding, the error3.xml is the same problem the other way around.

Declarations

XML and Text Declaration


[23] XMLDecl ::= ’<?xml’ VersionInfo [p.12] EncodingDecl
                 [p.38] ? SDDecl [p.15] ? S [p.8] ? ’?>’
[77] TextDecl ::= ’<?xml’ VersionInfo [p.12] ? EncodingDecl
                  [p.38] S [p.8] ? ’?>’

sample: <?xml version="1.0" encoding ='UTF-8'?>

The encoding declaration is part of the text declaration, in this form:


[80] EncodingDecl ::= S [p.8] ’encoding’ Eq
                      [p.12] (’"’ EncName [p.38]
                      ’"’ | "’" EncName [p.38]
                      "’" )

sample: encoding ='UTF-8'

Auto-Detecting the encoding

The W3C defined a non-normative (Appendix F).

There are 3 steps rules to check this.

  1. If there is a BOM, then this defines the file encoding.
  2. If there isn't then look at the Encoding Declaration.
  3. If none of the above meet then assume UTF-8.

Since all XML files must start with a text declaration and that start with <? You can look at those characters to try and find out which encoding is used.

So what about character entities?

Character entity permits you to write a character without actually having to write the character. The drawback is that you don't want to actually write a character entity for each of those accentuated characters, this maybe nice feature for a once-in-time character like the trademark sign, but not for writing actual language.

You can specify your entities in three ways:

  1. By name &iacute;
  2. By decimal number &#237;
  3. By hexadecimal number &#x00ed;

See the sample file entities.xml

For more information take a look at the following links.
http://www.w3.org/TR/REC-xml#charencoding
http://www.w3.org/TR/2000/REC-xml-20001006#sec-guessing

Back to Top

XSLT and the Output

Since XSLT is used to transform XML into other files formats (including XML), it's important to be able to specify the encoding that the developer of the XSLT wants on the destination.

The W3Cdefines a <xsl:output> element for this propose, and a encoding attribute to specify the pretended encoding of the output.


<!-- Category: top-level-element -->
     <xsl:output
          method = "xml" | "html" | "text" | qname-but-not-ncname 
          version = nmtoken
          encoding = string
          omit-xml-declaration = "yes" | "no"
          standalone = "yes" | "no"
          doctype-public = string
          doctype-system = string
          cdata-section-elements = qnames
          indent = "yes" | "no"
          media-type = string
     />

A XSLT Processor may perform the serialization of the result tree, but it's not required to. In the situation where the XSLT Processor does make the serialization of the result tree, it should, but it's not required to, to respect the definition in the <xsl:output> element.

For more information take a look at the following links.
http://www.w3.org/TR/1999/REC-xslt-19991116#output

Back to Top

Other tools, how they handle this?

The Unicode Consortium defines a set of rules for a Unicode enabled application, every application that claim to be unicode compliant should observe this set of rules.

The problem is that exist a large number of tools, applications and operating systems that still aren't Unicode compliant, so whenever you are trying to handle unicode files, certify yourself that you are using a unicode compliant application.

One of my first problems concerning Unicode and XML was because I was using a text editor that didn't support Unicode. One of these editors is the windows 95/98 notepad, this tool doesn't support Unicode so it will always write the file as ASCII based on the codepage that you have on selected on your Windows environment. The notepad that ships with Windows 2000 already has support for Unicode, you can define the encoding choosing one of the options in the encoding combo box, options are (ANSI, Unicode, Unicode Big-Endian and UTF-8).

Beware of the interpretation of MS of the word Unicode, almost every time that MS refers to Unicode they mean UTF-16.

This is referring to the part of saving files, when you are opening them this too can bring problems. Windows 2000 Notepad will read the BOM, and if present will respect it, but if no BOM is present, then we will scan the file trying to guest is encoding. If it decide this is an UTF-8 encoded, when you save it will put a BOM in the file.

Even if the tool support unicode, it can have different implementation, on the grey areas, for instance the editor I'm using, if no BOM is present and you open a UTF-8 encoded file, it will not scan the file to see if it's UTF-8 or not, like notepad 2000 does. So if during the edition you write a character like the "í" it will encode that according Windows codepage in a single byte encoding form.

So it's important that you check out your editor specification how it handles unicode, you can save yourself a lot of problem if you understand how it does it.

Objects/Interfaces encoding

Another frequent problem with encoding, are the objects/interfaces, and the way they handle character encoding. Well we could extend this article to more objects/interfaces, but since that would take forever. I recommend you to read careful the documentation of those and check out how they handle this situation, and even so, you will find that sometimes documentation is wrong, but that is maybe just a small part of it, even so you really must try it yourself.

MSXML 3sp1

DOMDocument

  1. Load/ LoadXml
  2. Those two methods serve to load a xml into the DOMDocument.

    Carefull that the MSXML 3 SDK documentation, mention there that the LoadXml, only handles a UCS-2 or UTF-16 encoded string, but if you look at the KB Article Q275883 http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP you will noticed that is says that this restriction is removes starting on MSXML 3.0.

    Sub DOMLoadXml()
    Dim docOut As MSXML2.DOMDocument
    Dim stm As ADODB.stream
    Dim strPath As String
    Dim bolRet As Boolean
    strPath = GetPath(CurrentDb.Name)
    Set stm = New ADODB.stream
    stm.Open
    stm.Charset = "UTF-8"
    stm.LoadFromFile strPath & "encUTF8.xml"
    'Juts to check it out
    MsgBox stm.Charset
    stm.Position = 0
    Set docOut = New MSXML2.DOMDocument
    bolRet = docOut.loadXML(stm.ReadText)
    If Not bolRet Then
    MsgBox "The xml string was not loaded"
    End If
    'Just making sure
    On Error Resume Next
    Kill strPath & "test.xml"
    docOut.Save strPath & "test.xml"
    End Sub
    

    If you check the test.xml file you will see that the character "í" is encoded as 0xC3AD, meaning that it's been read from UTF-8 and saved in UTF-8 too.

  3. Save
  4. The save method bases the decision of the character encoding in the encoding declaration. I've seen that in the previous snippet the file was saved in UTF-8, let see if it does is job with other encoding. If no encoding declaration is found then it will be encoded as UTF-8.

    Private Sub DOMSave()
    Dim docOut As MSXML2.DOMDocument
    Dim domPI As MSXML2.IXMLDOMProcessingInstruction
    Dim eleRoot As MSXML2.IXMLDOMElement
    Dim eleChild As MSXML2.IXMLDOMElement
    Dim strPath As String
    Set docOut = New MSXML2.DOMDocument
    Set domPI = docOut.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'")
    docOut.appendChild domPI
    Set eleRoot = docOut.createElement("Character")
    docOut.appendChild eleRoot
    Set eleChild = eleRoot.appendChild(docOut.createElement("chr"))
    eleChild.setAttribute "ShipAddress", "í"
    strPath = GetPath(CurrentDb.Name)
    On Error Resume Next
    Kill strPath & "test.xml"
    docOut.Save strPath & "test.xml"
    End Sub
    

    If you check the test.xml file you will see that the file as a BOM (0xFFFE) and that the character "í" is encoded as 0xED00, meaning that it's been saved as UTF-16.

  5. xml
  6. The xml read-only property retrieves the node and all is child’s.

    The part that concerns this article, this property will remove the encoding declaration and any BOM, of any file that is not UTF-16 encoded, this is handy to pass to the LoadXml.

    Private Sub DOMxml()
    Dim docIn As MSXML2.DOMDocument
    Dim strPath As String
    Dim lpBuffer() As Byte
    strPath = GetPath(CurrentDb.Name)
    Set docIn = New MSXML2.DOMDocument
    docIn.async = False
    docIn.Load strPath & "encUTF8.xml"
    MsgBox docIn.xml
    lpBuffer = docIn.xml
    Debug.Print "0x" & Hex(lpBuffer(0)) & Hex(lpBuffer(1)) & Hex(lpBuffer(2)) & Hex(lpBuffer(3))
    End Sub 
    

    Looking at the first 4 bytes of this string show us that indeed it’s encoding was changed to UTF-16.

  7. transformNode
  8. The transformNode Method takes a node and is children’s and transforms them according to the supplied XSLT stylesheet given.

    This will always return a string encoded as UTF-16 and will add the Encoding declaration with UTF-16 too, that way we doesn’t observe what was specified in the xsl:output element of the XSLT stylesheet.

    Sub DOMTransformNode()
    Dim docXML As MSXML2.DOMDocument
    Dim docXSL As MSXML2.DOMDocument
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set docXML = New MSXML2.DOMDocument
    docXML.async = False
    docXML.Load strPath & "encUTF8.xml"
    Set docXSL = New MSXML2.DOMDocument
    docXSL.async = False
    docXSL.Load strPath & "output_UTF8.xsl"
    MsgBox docXML.transformNode(docXSL)
    End Sub
    
  9. transformNodeToObject
  10. This does method does almost the same thing as transformNode, but it redirects the output to a object that is passed as the second parameter, in the topic of this text, I’ve to say that it’s much better, since transformNodeToObject, does respect the xsl:output element and it’s encoding attribute. So you can pass a UTF-8 file and convert it to UTF-16 and vice-versa.

    Sub DOMTransformNodeToObject()
    Dim docXML As MSXML2.DOMDocument
    Dim docXSL As MSXML2.DOMDocument
    Dim docOut As MSXML2.DOMDocument
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set docXML = New MSXML2.DOMDocument
    docXML.async = False
    docXML.Load strPath & "encUTF16.xml"
    Set docXSL = New MSXML2.DOMDocument
    docXSL.async = False
    docXSL.Load strPath & "encUTF8.xml"
    Set docOut = New MSXML2.DOMDocument
    Call docXML.transformNodeToObject(docXSL, docOut)
    docOut.Save strPath & "test.xml"
    End Sub
    

IXSLProcessor

  1. transform / output
  2. Another way to transform a xml file with a xslt stylesheet is using the IXSLProcessor. The transform method it’s the way to start the transformation, so what we are really interrested in here is how the output property works. This property can lead to different results depending in the way that you read it’s content, MSXML 3 SDK gives a brief note where they state that if you read sequentially from the output property then the output generated is always UTF-16 encoded and the xsl:output element is ignored.

    Sub TranfWithIXSLProcessor()
    Dim xtp As MSXML2.XSLTemplate
    Dim xpr As MSXML2.IXSLProcessor
    Dim ftdXSL As MSXML2.FreeThreadedDOMDocument
    Dim docXML As MSXML2.DOMDocument
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set docXML = New MSXML2.DOMDocument
    docXML.async = False
    docXML.Load strPath & "encUTF8.xml"
    Set ftdXSL = New MSXML2.FreeThreadedDOMDocument
    ftdXSL.async = False
    ftdXSL.Load strPath & "Output_UTF8.xsl"
    Set xtp = New MSXML2.XSLTemplate
    Set xtp.stylesheet = ftdXSL
    Set xpr = xtp.createProcessor()
    xpr.input = docXML
    xpr.transform
    MsgBox xpr.output
    End Sub
     
    

    So the result will be UTF-16 encoded, and a encoding declaration is added too.

    BUT, you can redirect your output to a object, for instance a ADOStream and that way the transformation will respect the xsl:output element and is encoding attribute.

    Sub TranfWithIXSLProcessorToStream_v2()
    Dim xtp As MSXML2.XSLTemplate
    Dim xpr As MSXML2.IXSLProcessor
    Dim ftdXSL As MSXML2.FreeThreadedDOMDocument
    Dim docXML As MSXML2.DOMDocument
    Dim stmOut As ADODB.stream
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set docXML = New MSXML2.DOMDocument
    docXML.async = False
    docXML.Load strPath & "encUTF16.xml"
    Set ftdXSL = New MSXML2.FreeThreadedDOMDocument
    ftdXSL.async = False
    ftdXSL.Load strPath & "Output_UTF8.xsl"
    Set xtp = New MSXML2.XSLTemplate
    Set xtp.stylesheet = ftdXSL
    Set xpr = xtp.createProcessor()
    xpr.input = docXML
    Set stmOut = New ADODB.stream
    stmOut.Open
    stmOut.Charset = "UTF-8"
    xpr.output = stmOut
    xpr.transform
    stmOut.Position = 0
    MsgBox stmOut.ReadText
    End Sub
    

MXXMLWriter and SAXXMLReader

  1. Several Properties
  2. MXXMLWriter was a nice variety of properties that let you control how the output is going to be, beware that this large flexibility, can bring you unusable files.

    The first thing to understand is the output, by default this will be redirected to a string, but you can specify any object that supports the IStream interface. When using this from VB you must understand the the strings in VB are always UTF-16 encoded, so if you want your file encoded in a diferent way you will have to go with the IStream interface. So if you have the output redirected to a string the folowing properties will be ignored (byteOrderMark and encoding).

    I will be using the ADO Stream, to show you this properties.

    OmitXMLDeclaration - Control if the Text Declaration is outputted or not.

    ByteOrderMark - Control if the BOM is outputted or not.

    encoding - Defines the output encoding.

    Private Sub SaxReaderWriter()
    Dim rdr As New MSXML2.SAXXMLReader
    Dim wrt As New MSXML2.MXXMLWriter
    Dim stm As ADODB.stream
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set rdr.contentHandler = wrt
    wrt.omitXMLDeclaration = False
    wrt.byteOrderMark = True
    wrt.encoding = "UTF-8"
    Set stm = New ADODB.stream
    stm.Open
    'You need to set the Stream Chartset, explain below
    stm.Charset = "UTF-8"
    wrt.output = stm
    rdr.parseURL strPath & "encUTF16.xml"
    wrt.Flush
    stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
    End Sub
    

    On the SAXXMLReader you can obtain this properties by using the getProperty and you can insert them by using the putProperty.

ADO 2.6

Recordset

  1. save (adPersistXML)
  2. ADO uses UTF-8 when saving a file using the adPersistXML.

    Sub PersistADOtoXML()
    Dim cnn As ADODB.Connection
    Dim rst As ADODB.Recordset
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set cnn = New ADODB.Connection
    cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
    "Data Source=" & CurrentDb.Name & ";"
    Set rst = New ADODB.Recordset
    rst.Open "SELECT * FROM tblTest", cnn
    On Error Resume Next
    Kill strPath & "test.xml"
    rst.Save strPath & "test.xml", adPersistXML
    rst.Close
    Set rst = Nothing
    End Sub
    

Stream

  1. LoadFromFile / SaveToFile
  2. LoadFromFile method loads the content of a file in the stream, you can load any encoding you want, but by default when you open a stream object it’s default encoding is UTF-16. so if you wish to load a file in another encoding you must change the charset of the stream.

    Sub ReadSaveFileInUTF8()
    Dim stm As ADODB.stream
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set stm = New ADODB.stream
    stm.Open
    stm.Charset = "UTF-8"
    stm.LoadFromFile strPath & "encUTF8.xml"
    stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
    End Sub
    
  3. Charset
  4. The default value for the charset of a stream is UTF-16, and you can set this property to tell the stream that it’s receiving a that type of encoded text, I’ve used this in the previous sample to indicate that I was loading a UTF-8 encoded file.

    One of the other properties that the documentation states is that this property can do the translation from one encoding form to another, well in some samples that I’ve made this can cause you serious troubles, and leave you with a useless file. For instance let’s say that you want to convert a UTF-16 file to UTF-8, you could use some code like this

    Sub TranlateToUTF8()
    'ERROR Don't Use this code, jut to expose the error
    Dim stm As ADODB.stream
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set stm = New ADODB.stream
    stm.Open
    stm.LoadFromFile strPath & "encUTF16.xml"
    stm.Charset = "UTF-8"
    stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
    End Sub
    

    Well this will really damage your file, besides de fact that you now habe two BOM’s on for UTF-8 0xEF BB BF and the one that allready was there for UTF-16 0xFF FE, and the file is still two bytes (UTF-16 encoded).

    Mixing both the recordset object and the stream object, would by KB Article http://support.microsoft.com/support/kb/articles/Q259/5/55.ASP, lead to a problem, and that’s understandable explanation on that KB article, but with the version of ADO (2.60.6526.0) that I’ve I couldn’t reach the problem described on the text.

    Sub PersistStreamToXML()
    Dim cnn As ADODB.Connection
    Dim rst As ADODB.Recordset
    Dim stm As ADODB.stream
    Dim strPath As String
    strPath = GetPath(CurrentDb.Name)
    Set cnn = New ADODB.Connection
    cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _
    "Data Source=" & CurrentDb.Name & ";"
    Set rst = New ADODB.Recordset
    rst.Open "SELECT * FROM tblTest", cnn
    Set stm = New ADODB.stream
    rst.Save stm, adPersistXML
    rst.Close
    Set rst = Nothing
    MsgBox stm.Charset
    stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite
    Set stm = Nothing
    End Sub
    

    This output a perfect UTF-16 encoded file…

Back to Top

Final Comment

Well I could go forever with this part of the article, since there are so many objects and interfaces, that is impossible to cover them all in just one article, I've tried to talk about the more obvious ones, that at least I'm using. Also I made a deliberated choice of only describe the MSXML parser, other will behave differently, it's up to you to be advise to read their documentation, and hope no to be bitten by encoding problem. I'm aware that it was just a few words on the subject as the titles of this state.

Back to Top

RESOURCES:

From all of them I would like to point out Michael Kaplan Book, if you don't have it IMHO it's a very helpful resource, with the i18n subject

http://www.i18nwithvb.com/

http://www.amazon.com/exec/obidos/ASIN/0672319772/002-9388292-4866449

A good article "XML International" in XML Developer (October 2000), by Michael Kaplan.

http://www.xmldevelopernewsletter.com

I've also found the article by Chris Lovett in MSDN a good resource.

http://msdn.microsoft.com/library/en-us/dnxml/html/xmlencodings.asp

MSXML 3 SDK download it at

http://download.microsoft.com/download/xml/SDK/3.0/WIN98Me/EN-US/xmlsdk.exe

KB Article http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP

Fundamental for Specifications

http://www.unicode.org/

http://www.w3.org/

Back to Top

Any comment and problem's please send a e-mail to pmpg98@hotmail.com
Print Version
Home XML/XSLT Examples