![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
||||||||||||||||||||
![]() |
||||||||||||||||||||||||||||||||||||||
Character Encoding... A few words on the subject | ||||||||||||||||||||||||||||||||||||||
article code | ||||||||||||||||||||||||||||||||||||||
What is Unicode?I've seen lots of confusion, and even the documentation sometimes brings a misleading belief of what in fact is Unicode. Unicode is a worldwide character-encoding standard, published by the Unicode Consortium. Computers store numbers that represent a character, Unicode provides a unique number for every character. Unicode is compatible with the ISO/IEC 10646. Now come the need to encode such characters and that made available several ways of doing just that, there are several ways of representing the same characters So what the heck mean UCS-* and UTF-*UCS is just the acronym of Universal Character Set specified by the ISO/IEC 10646. The number that accompanies this acronym indicates the number of octets that the character set is coded. UCS-2 (two octets) and UCS-4 (four octets) UTF is the acronym for Unicode Transformation Format In order to be compatible with older systems that didn't support Unicode, the Unicode Consortium, defined what they call Encoding Forms, that is the representation of the character in bits, the number that accompanies this acronym indicates the encoding form that is to be used. So UTF-8 (8-bit encoding form), UTF-16 (16-bit encoding form). Be aware that this doesn't mean that in the case of the UTF-8 it will always use only 8 bits. UTF use bit-shifting techniques to encode Unicode characters, UTF-8 encodes each Unicode character as variable number of bytes (usually 3), a particularity of UTF-8 is that character below the 127 range in ASCII have the same byte values in ASCII and UTF-8. UTF-16 always encodes each Unicode character as two bytes. Little-Endian and Big-EndianSome microprocessors differ in the use of the least significant byte, So you have a different BOM (see below) to indicate this byte order. Sometimes you will see the UTF-16LE and UTF-16BE denomination to indicate this byte order. What is a BOM?BOM (Byte Order Mark), this is a encoding signature for the file. A particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order. UTF-8 - EF BB BF -  UTF-16LE - FF FE - ÿþ UTF-16BE - FE FF - þÿ This encoding signature is not to be displayed, any tool that support Unicode will understand this and will not show this to you nor consider it to be part of the text file. So you also have a problem that is how can I find out what BOM is being used. Well checking the Hexadecimal of the file or opening the file in a non-unicode text editor will give you those characters presented in the above table. How are character representedIt’s important to understand how characters are encoded, all my samples use the character í. These characters along with other accentuated characters are very common on my natural language. I’ve opted for this one because it’s easy for me to write it, and it’s above the 127 range in ASCII, making him two bytes in UTF-8 (this will be important to show up some problems). I've made this simple table to try and explain all this, hopefully you will understand it a little bit better…
For more information take a look at the following links.
XML and character encodingThe W3C specify that all XML processor must read the UTF-8 and UTF-16 encoding. This text explain that to differentiate between UTF-8 and UTF-16 a BOM must be present, and that the BOM must be used by the parser as encoding signature. Other encoding may be supported, but no parser is required to have support for all of them, or one in particular, besides the UTF-8 and UTF-16. Well if this text is not enough to convince you that you should be using Unicode for all your XML files I don't known what else can I say. Well you should really give a good enough though every time you think using anything else besides Unicode (UTF-8 or UTF-16). With IE and MSXML, there are two really common errors that happen when something is not correct with the three steps defined below. An invalid character was found in text content (see my sample file error1.xml) A parser found a character on your file that is not according the encoding declaration or the BOM specified for that file. In the sample I've the character encoded with ISO-8859-1 and then speficy the UTF-8 on the encoding declaration, this will issue the error. The reason that it doesn't issue the error below is that since the character is encoded with a one byte long encoding (ISO-8859-1), the parser at the beginning find this a valid character, finding later that can't understand the character that was used. Switch from current encoding to specified encoding not supported In a basic thinking this error is almost identical to the previous, the only thing is that the parser understands that the real encoding on the file is diferent from the one in the encoding declaration. What the error is trying to tell you is that it can’t make the switch from the file encoding to the one you specify on the encoding declaration. In error2.xml I've a file encoded as UTF-8 and a text encoding of UTF-16, since UTF-16 must always be two bytes, the parser known forehand that something is wrong with the encoding, the error3.xml is the same problem the other way around. DeclarationsXML and Text Declaration [23] XMLDecl ::= ’<?xml’ VersionInfo [p.12] EncodingDecl [p.38] ? SDDecl [p.15] ? S [p.8] ? ’?>’ [77] TextDecl ::= ’<?xml’ VersionInfo [p.12] ? EncodingDecl [p.38] S [p.8] ? ’?>’ sample: <?xml version="1.0" encoding ='UTF-8'?> The encoding declaration is part of the text declaration, in this form: [80] EncodingDecl ::= S [p.8] ’encoding’ Eq [p.12] (’"’ EncName [p.38] ’"’ | "’" EncName [p.38] "’" ) sample: encoding ='UTF-8' Auto-Detecting the encodingThe W3C defined a non-normative (Appendix F). There are 3 steps rules to check this.
Since all XML files must start with a text declaration and that start with <? You can look at those characters to try and find out which encoding is used. So what about character entities?Character entity permits you to write a character without actually having to write the character. The drawback is that you don't want to actually write a character entity for each of those accentuated characters, this maybe nice feature for a once-in-time character like the trademark sign, but not for writing actual language. You can specify your entities in three ways:
See the sample file entities.xml For more information take a look at the following links. XSLT and the OutputSince XSLT is used to transform XML into other files formats (including XML), it's important to be able to specify the encoding that the developer of the XSLT wants on the destination. The W3Cdefines a <xsl:output> element for this propose, and a encoding attribute to specify the pretended encoding of the output. <!-- Category: top-level-element --> <xsl:output method = "xml" | "html" | "text" | qname-but-not-ncname version = nmtoken encoding = string omit-xml-declaration = "yes" | "no" standalone = "yes" | "no" doctype-public = string doctype-system = string cdata-section-elements = qnames indent = "yes" | "no" media-type = string /> A XSLT Processor may perform the serialization of the result tree, but it's not required to. In the situation where the XSLT Processor does make the serialization of the result tree, it should, but it's not required to, to respect the definition in the <xsl:output> element. For more information take a look at the following links. Other tools, how they handle this?The Unicode Consortium defines a set of rules for a Unicode enabled application, every application that claim to be unicode compliant should observe this set of rules. The problem is that exist a large number of tools, applications and operating systems that still aren't Unicode compliant, so whenever you are trying to handle unicode files, certify yourself that you are using a unicode compliant application. One of my first problems concerning Unicode and XML was because I was using a text editor that didn't support Unicode. One of these editors is the windows 95/98 notepad, this tool doesn't support Unicode so it will always write the file as ASCII based on the codepage that you have on selected on your Windows environment. The notepad that ships with Windows 2000 already has support for Unicode, you can define the encoding choosing one of the options in the encoding combo box, options are (ANSI, Unicode, Unicode Big-Endian and UTF-8). Beware of the interpretation of MS of the word Unicode, almost every time that MS refers to Unicode they mean UTF-16. This is referring to the part of saving files, when you are opening them this too can bring problems. Windows 2000 Notepad will read the BOM, and if present will respect it, but if no BOM is present, then we will scan the file trying to guest is encoding. If it decide this is an UTF-8 encoded, when you save it will put a BOM in the file. Even if the tool support unicode, it can have different implementation, on the grey areas, for instance the editor I'm using, if no BOM is present and you open a UTF-8 encoded file, it will not scan the file to see if it's UTF-8 or not, like notepad 2000 does. So if during the edition you write a character like the "í" it will encode that according Windows codepage in a single byte encoding form. So it's important that you check out your editor specification how it handles unicode, you can save yourself a lot of problem if you understand how it does it. Objects/Interfaces encodingAnother frequent problem with encoding, are the objects/interfaces, and the way they handle character encoding. Well we could extend this article to more objects/interfaces, but since that would take forever. I recommend you to read careful the documentation of those and check out how they handle this situation, and even so, you will find that sometimes documentation is wrong, but that is maybe just a small part of it, even so you really must try it yourself. MSXML 3sp1DOMDocument
Those two methods serve to load a xml into the DOMDocument. Carefull that the MSXML 3 SDK documentation, mention there that the LoadXml, only handles a UCS-2 or UTF-16 encoded string, but if you look at the KB Article Q275883 http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP you will noticed that is says that this restriction is removes starting on MSXML 3.0. Sub DOMLoadXml() Dim docOut As MSXML2.DOMDocument Dim stm As ADODB.stream Dim strPath As String Dim bolRet As Boolean strPath = GetPath(CurrentDb.Name) Set stm = New ADODB.stream stm.Open stm.Charset = "UTF-8" stm.LoadFromFile strPath & "encUTF8.xml" 'Juts to check it out MsgBox stm.Charset stm.Position = 0 Set docOut = New MSXML2.DOMDocument bolRet = docOut.loadXML(stm.ReadText) If Not bolRet Then MsgBox "The xml string was not loaded" End If 'Just making sure On Error Resume Next Kill strPath & "test.xml" docOut.Save strPath & "test.xml" End Sub If you check the test.xml file you will see that the character "í" is encoded as 0xC3AD, meaning that it's been read from UTF-8 and saved in UTF-8 too. The save method bases the decision of the character encoding in the encoding declaration. I've seen that in the previous snippet the file was saved in UTF-8, let see if it does is job with other encoding. If no encoding declaration is found then it will be encoded as UTF-8. Private Sub DOMSave() Dim docOut As MSXML2.DOMDocument Dim domPI As MSXML2.IXMLDOMProcessingInstruction Dim eleRoot As MSXML2.IXMLDOMElement Dim eleChild As MSXML2.IXMLDOMElement Dim strPath As String Set docOut = New MSXML2.DOMDocument Set domPI = docOut.createProcessingInstruction("xml", "version='1.0' encoding='UTF-16'") docOut.appendChild domPI Set eleRoot = docOut.createElement("Character") docOut.appendChild eleRoot Set eleChild = eleRoot.appendChild(docOut.createElement("chr")) eleChild.setAttribute "ShipAddress", "í" strPath = GetPath(CurrentDb.Name) On Error Resume Next Kill strPath & "test.xml" docOut.Save strPath & "test.xml" End Sub If you check the test.xml file you will see that the file as a BOM (0xFFFE) and that the character "í" is encoded as 0xED00, meaning that it's been saved as UTF-16. The xml read-only property retrieves the node and all is child’s. The part that concerns this article, this property will remove the encoding declaration and any BOM, of any file that is not UTF-16 encoded, this is handy to pass to the LoadXml. Private Sub DOMxml() Dim docIn As MSXML2.DOMDocument Dim strPath As String Dim lpBuffer() As Byte strPath = GetPath(CurrentDb.Name) Set docIn = New MSXML2.DOMDocument docIn.async = False docIn.Load strPath & "encUTF8.xml" MsgBox docIn.xml lpBuffer = docIn.xml Debug.Print "0x" & Hex(lpBuffer(0)) & Hex(lpBuffer(1)) & Hex(lpBuffer(2)) & Hex(lpBuffer(3)) End Sub Looking at the first 4 bytes of this string show us that indeed it’s encoding was changed to UTF-16. The transformNode Method takes a node and is children’s and transforms them according to the supplied XSLT stylesheet given. This will always return a string encoded as UTF-16 and will add the Encoding declaration with UTF-16 too, that way we doesn’t observe what was specified in the xsl:output element of the XSLT stylesheet. Sub DOMTransformNode() Dim docXML As MSXML2.DOMDocument Dim docXSL As MSXML2.DOMDocument Dim strPath As String strPath = GetPath(CurrentDb.Name) Set docXML = New MSXML2.DOMDocument docXML.async = False docXML.Load strPath & "encUTF8.xml" Set docXSL = New MSXML2.DOMDocument docXSL.async = False docXSL.Load strPath & "output_UTF8.xsl" MsgBox docXML.transformNode(docXSL) End Sub This does method does almost the same thing as transformNode, but it redirects the output to a object that is passed as the second parameter, in the topic of this text, I’ve to say that it’s much better, since transformNodeToObject, does respect the xsl:output element and it’s encoding attribute. So you can pass a UTF-8 file and convert it to UTF-16 and vice-versa. Sub DOMTransformNodeToObject() Dim docXML As MSXML2.DOMDocument Dim docXSL As MSXML2.DOMDocument Dim docOut As MSXML2.DOMDocument Dim strPath As String strPath = GetPath(CurrentDb.Name) Set docXML = New MSXML2.DOMDocument docXML.async = False docXML.Load strPath & "encUTF16.xml" Set docXSL = New MSXML2.DOMDocument docXSL.async = False docXSL.Load strPath & "encUTF8.xml" Set docOut = New MSXML2.DOMDocument Call docXML.transformNodeToObject(docXSL, docOut) docOut.Save strPath & "test.xml" End Sub IXSLProcessor
Another way to transform a xml file with a xslt stylesheet is using the IXSLProcessor. The transform method it’s the way to start the transformation, so what we are really interrested in here is how the output property works. This property can lead to different results depending in the way that you read it’s content, MSXML 3 SDK gives a brief note where they state that if you read sequentially from the output property then the output generated is always UTF-16 encoded and the xsl:output element is ignored. Sub TranfWithIXSLProcessor() Dim xtp As MSXML2.XSLTemplate Dim xpr As MSXML2.IXSLProcessor Dim ftdXSL As MSXML2.FreeThreadedDOMDocument Dim docXML As MSXML2.DOMDocument Dim strPath As String strPath = GetPath(CurrentDb.Name) Set docXML = New MSXML2.DOMDocument docXML.async = False docXML.Load strPath & "encUTF8.xml" Set ftdXSL = New MSXML2.FreeThreadedDOMDocument ftdXSL.async = False ftdXSL.Load strPath & "Output_UTF8.xsl" Set xtp = New MSXML2.XSLTemplate Set xtp.stylesheet = ftdXSL Set xpr = xtp.createProcessor() xpr.input = docXML xpr.transform MsgBox xpr.output End Sub So the result will be UTF-16 encoded, and a encoding declaration is added too. BUT, you can redirect your output to a object, for instance a ADOStream and that way the transformation will respect the xsl:output element and is encoding attribute. Sub TranfWithIXSLProcessorToStream_v2() Dim xtp As MSXML2.XSLTemplate Dim xpr As MSXML2.IXSLProcessor Dim ftdXSL As MSXML2.FreeThreadedDOMDocument Dim docXML As MSXML2.DOMDocument Dim stmOut As ADODB.stream Dim strPath As String strPath = GetPath(CurrentDb.Name) Set docXML = New MSXML2.DOMDocument docXML.async = False docXML.Load strPath & "encUTF16.xml" Set ftdXSL = New MSXML2.FreeThreadedDOMDocument ftdXSL.async = False ftdXSL.Load strPath & "Output_UTF8.xsl" Set xtp = New MSXML2.XSLTemplate Set xtp.stylesheet = ftdXSL Set xpr = xtp.createProcessor() xpr.input = docXML Set stmOut = New ADODB.stream stmOut.Open stmOut.Charset = "UTF-8" xpr.output = stmOut xpr.transform stmOut.Position = 0 MsgBox stmOut.ReadText End Sub MXXMLWriter and SAXXMLReader
MXXMLWriter was a nice variety of properties that let you control how the output is going to be, beware that this large flexibility, can bring you unusable files. The first thing to understand is the output, by default this will be redirected to a string, but you can specify any object that supports the IStream interface. When using this from VB you must understand the the strings in VB are always UTF-16 encoded, so if you want your file encoded in a diferent way you will have to go with the IStream interface. So if you have the output redirected to a string the folowing properties will be ignored (byteOrderMark and encoding). I will be using the ADO Stream, to show you this properties. OmitXMLDeclaration - Control if the Text Declaration is outputted or not. ByteOrderMark - Control if the BOM is outputted or not. encoding - Defines the output encoding. Private Sub SaxReaderWriter() Dim rdr As New MSXML2.SAXXMLReader Dim wrt As New MSXML2.MXXMLWriter Dim stm As ADODB.stream Dim strPath As String strPath = GetPath(CurrentDb.Name) Set rdr.contentHandler = wrt wrt.omitXMLDeclaration = False wrt.byteOrderMark = True wrt.encoding = "UTF-8" Set stm = New ADODB.stream stm.Open 'You need to set the Stream Chartset, explain below stm.Charset = "UTF-8" wrt.output = stm rdr.parseURL strPath & "encUTF16.xml" wrt.Flush stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite End Sub On the SAXXMLReader you can obtain this properties by using the getProperty and you can insert them by using the putProperty. ADO 2.6Recordset
ADO uses UTF-8 when saving a file using the adPersistXML. Sub PersistADOtoXML() Dim cnn As ADODB.Connection Dim rst As ADODB.Recordset Dim strPath As String strPath = GetPath(CurrentDb.Name) Set cnn = New ADODB.Connection cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _ "Data Source=" & CurrentDb.Name & ";" Set rst = New ADODB.Recordset rst.Open "SELECT * FROM tblTest", cnn On Error Resume Next Kill strPath & "test.xml" rst.Save strPath & "test.xml", adPersistXML rst.Close Set rst = Nothing End Sub Stream
LoadFromFile method loads the content of a file in the stream, you can load any encoding you want, but by default when you open a stream object it’s default encoding is UTF-16. so if you wish to load a file in another encoding you must change the charset of the stream. Sub ReadSaveFileInUTF8() Dim stm As ADODB.stream Dim strPath As String strPath = GetPath(CurrentDb.Name) Set stm = New ADODB.stream stm.Open stm.Charset = "UTF-8" stm.LoadFromFile strPath & "encUTF8.xml" stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite End Sub The default value for the charset of a stream is UTF-16, and you can set this property to tell the stream that it’s receiving a that type of encoded text, I’ve used this in the previous sample to indicate that I was loading a UTF-8 encoded file. One of the other properties that the documentation states is that this property can do the translation from one encoding form to another, well in some samples that I’ve made this can cause you serious troubles, and leave you with a useless file. For instance let’s say that you want to convert a UTF-16 file to UTF-8, you could use some code like this Sub TranlateToUTF8() 'ERROR Don't Use this code, jut to expose the error Dim stm As ADODB.stream Dim strPath As String strPath = GetPath(CurrentDb.Name) Set stm = New ADODB.stream stm.Open stm.LoadFromFile strPath & "encUTF16.xml" stm.Charset = "UTF-8" stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite End Sub Well this will really damage your file, besides de fact that you now habe two BOM’s on for UTF-8 0xEF BB BF and the one that allready was there for UTF-16 0xFF FE, and the file is still two bytes (UTF-16 encoded). Mixing both the recordset object and the stream object, would by KB Article http://support.microsoft.com/support/kb/articles/Q259/5/55.ASP, lead to a problem, and that’s understandable explanation on that KB article, but with the version of ADO (2.60.6526.0) that I’ve I couldn’t reach the problem described on the text. Sub PersistStreamToXML() Dim cnn As ADODB.Connection Dim rst As ADODB.Recordset Dim stm As ADODB.stream Dim strPath As String strPath = GetPath(CurrentDb.Name) Set cnn = New ADODB.Connection cnn.Open "Provider=Microsoft.Jet.OLEDB.4.0;" & _ "Data Source=" & CurrentDb.Name & ";" Set rst = New ADODB.Recordset rst.Open "SELECT * FROM tblTest", cnn Set stm = New ADODB.stream rst.Save stm, adPersistXML rst.Close Set rst = Nothing MsgBox stm.Charset stm.SaveToFile strPath & "test.xml", adSaveCreateOverWrite Set stm = Nothing End Sub This output a perfect UTF-16 encoded file… Final CommentWell I could go forever with this part of the article, since there are so many objects and interfaces, that is impossible to cover them all in just one article, I've tried to talk about the more obvious ones, that at least I'm using. Also I made a deliberated choice of only describe the MSXML parser, other will behave differently, it's up to you to be advise to read their documentation, and hope no to be bitten by encoding problem. I'm aware that it was just a few words on the subject as the titles of this state. RESOURCES:From all of them I would like to point out Michael Kaplan Book, if you don't have it IMHO it's a very helpful resource, with the i18n subject http://www.amazon.com/exec/obidos/ASIN/0672319772/002-9388292-4866449 A good article "XML International" in XML Developer (October 2000), by Michael Kaplan. http://www.xmldevelopernewsletter.com I've also found the article by Chris Lovett in MSDN a good resource. http://msdn.microsoft.com/library/en-us/dnxml/html/xmlencodings.asp MSXML 3 SDK download it at http://download.microsoft.com/download/xml/SDK/3.0/WIN98Me/EN-US/xmlsdk.exe KB Article http://support.microsoft.com/support/kb/articles/Q275/8/83.ASP Fundamental for Specifications |
||||||||||||||||||||||||||||||||||||||
Any comment and problem's please send a e-mail to pmpg98@hotmail.com | ||||||||||||||||||||||||||||||||||||||
![]() |
||||||||||||||||||||||||||||||||||||||
Print Version | ||||||||||||||||||||||||||||||||||||||
Home | XML/XSLT Examples | |||||||||||||||||||||||||||||||||||||