What is XML?


What is XML? 
XML stands for Extensible Markup Language. Spearheaded by the World Wide Web Consortium (W3C), XML became a formal specification in mid-February-1997. 
XML developers will tell you that XML isn't a language but rather a system for defining other languages. You may have already heard of, or even used, one of these other languages--Microsoft's Channel Definition Format (CDF) for push, for example. 
The W3C, which is working on a slew of XML-related recommendations, calls XML "a common syntax for expressing structure in data." Structured data refers to data that is tagged for its content, meaning, or use. For example, whereas the <H1> tag in HTML specifies text to be presented in a certain typeface and weight, an XML tag would explicitly identify the kind of information: <BYLINE> tags might identify the author of a document, <PRICE> tags could contain an item's cost in an inventory list--all the way down to <DOGFOODBRAND> if that's the level of detail required. 
By separating structure and content from presentation, the same XML source document can be written once, then displayed in a variety of ways: on a computer monitor, within a cellular-phone display, translated into voice on a device for the blind, and so forth. It'll work on any communications devices that might be developed; an XML document can thus outlive the particular authoring and display technologies available when it was written. 
So XML will have a life outside of the Internet, serving the publishing industry at large, for example, and especially people who produce documents intended to appear across multiple media. Some large-scale document publishers who have been using Standard Generalized Markup Language (SGML) for years will convert to XML. Still, platform-independent XML was developed for the Web, and that's where it will have the most impact. 
 
The DOMXML's real strength for the Web is how it interacts with the Document Object Model (DOM), an interface that defines the mechanisms for accessing data in a document.
 
Using the DOM, programmers can script dynamic content in a standardized way. In other words, they can use it to cause a specific piece of content in a browser's document tree to behave in a certain way, creating a small effect--for example, a piece of text might turn blue when a user mouses over it. Both Netscape Navigator and Microsoft Internet Explorer have their own proprietary DOMs, but both companies say they will support the W3C standard DOM in the next versions of their browsers. 
 
Why is XML important? 
The saying among Web heads is that content is king. Unfortunately, too often that content is intimately tied to how it's displayed. How many times have you come across a Web site with a little disclaimer saying "Best viewed at 800-by-600-pixel resolution"? 
XML will help solve that problem because, rather than specifying where to display something, Web builders will be able to specify the structure of the document. For example, you can specify the document's title, its author, a list of related links, and so on. Then any device with an XML browser--a palm-top computer, a set-top box, or a high-powered workstation, for example--will be able to render a version of the document specifically tailored to that device. 
Perhaps XML's best feature, though, is its inherent extensibility. Companies and organizations will be able to extend XML to meet new challenges and applications. One XML-based language is already in use--Microsoft's Channel Definition Format (CDF)--and more are on the way, including the Resource Definition Format (RDF) and the Open Software Description (OSD). 
XML also holds the promise of becoming a standardized mechanism for the exchange of data as well as documents. For example, XML may become a way for databases from different vendors to exchange information across the Internet. 
It's still too early to determine precisely where XML is heading. But the possibilities are awesome, a big reason why there's so much excitement surrounding XML. 
 
How are SGML, HTML, and XML related? 
Standard Generalized Markup Language (SGML) is a way of expressing data in text-processing applications. It's been around for more than a decade; both XML and HTML are document formats derived from SGML. Thus they all share certain characteristics, such as a similar syntax and the use of bracketed tags. But HTML is an application of SGML, whereas XML is a subset of SGML. 
The distinction is important. Basically, HTML can't be used to define new applications, but XML can. For example, both the Resource Description Format (RDF) and the Channel Description Format (CDF) are applications that were defined using XML. XML and HTML are really more like cousins than siblings. (The W3C has developed a great diagram to help clarify this relationship.) 
XML is actually compatible with SGML: XML documents can be read by any SGML authoring or viewing tool. However, XML is less complex than SGML, and it's designed to work across a limited-bandwidth network such as the Internet. According to Tim Bray, coeditor of the XML specification, the idea behind XML was to take the benefits of SGML, remove the complicated parts, keep it light, and make it work on the Web. 
HTML, SGML, and XML will continue to be used where appropriate; none of them will render the others obsolete. HTML will remain the simplest way to publish data quickly on the Web, mostly short-term data such as meeting agendas or advertising brochures. If the data has a longer-term use and needs a bit more structure, Web builders will want to move to XML. Unlike HTML and XML, SGML will probably never gain widespread acceptance on the Internet, simply because it was never designed or optimized for the demands of a network protocol. For high-end, highly structured publishing applications, SGML will continue to fit the bill. 
 
How will XML is implemented? 
XML will be used in a couple of different ways. One is for data interchange between humans and machines, such as from a Web server to a user's browser. The other is for data exchange between applications, or from machine to machine. 
In either case, you'll likely require a three-tiered architecture: a database back end; a middle-tier server, where the business logic acts on the data; and the client, where the data is displayed and processed further. The database can receive information, perhaps already XML-formatted, from multiple data sources. The middle tier can then pull together the data and publish it to the final-presentation tier. 
Today, Web pages are sometimes delivered this way; for example news.com publishes from a database. But to get a new view of a page, such as News.com's new printer-friendly option, the server has to generate a new page. A properly formatted XML document will allow the client application to modify the appearance of the document for different media, such as a printer. 
 
What is a Document Type Definition (DTD)? 
A Document Type Definition (DTD) is a set of syntax rules for tags. It tells you what tags you can use in a document, what order they should appear in, which tags can appear inside other ones, which tags have attributes, and so on. Originally developed for use with SGML, a DTD can be part of an XML document, but it's usually a separate document or series of documents. 
Because XML is not a language itself, but rather a system for defining languages, it doesn't have a universal DTD the way HTML does. Instead, each industry or organization that wants to use XML for data exchange can define its own DTDs. 
If an organization uses XML to tag documents for internal use only, it can create its own private DTD. The Wall Street Journal Interactive Edition, for example, has a DTD specifying each edition, with information about pages, articles, summaries, bylines, and so forth. The Journal currently uses an SGML DTD (called the Dow Jones Markup Language), but it is developing an XML version as well. 
DTDs are not free from controversy. While some people feel they add substantial value in business, others feel they constrain creativity. Still others think they're useful but don't go far enough. Microsoft is attempting to address this last complaint with its XML-Data proposal, but critics say these improvements should be made within the DTD specification itself. 
 
Microsoft's schemaA group of vendors including Microsoft has proposed an alternative approach to the DTD called a schema, which they have submitted to the W3C as XML-Data. Like a DTD, a schema provides the rules of a document and indicates what tags are used, what their attributes are, the relationships between the tags, and so on.
 
Unlike DTDs, however, a schema can define data types. For example, a DTD might have a tag designated as <PRICE>, but the content contained within that tag could be a number or a character string. A schema could force you to enter a number. 
This approach clearly has benefits, especially for data exchange among applications, objects, or databases. The only question is whether this approach will somehow be rolled into the DTD specification or end up as a separate extension to XML 
 
What are well-formed and valid documents? 
There are essentially two related types of XML documents: well-formed and valid. A well-formed XML document conforms to the general rules of XML syntax, which are more rigorous than those of either HTML or SGML (Standard General Markup Language). XML character data is never left hanging without an ending markup designation of some sort, either an end tag, as in the tag pair <MYTAG></MYTAG>, or a special empty element tag with a forward slash before the right angle bracket, such as <MYTAG/>. XML markup always starts with a left angle bracket or an ampersand; element types and attribute names are case-sensitive; attributes require quotation marks; and so on. 
Valid XML documents are documents that conform to a specific Document Type Definition (DTD). Confirming the validity of XML documents is largely the work of authoring and publishing tools, whereas XML-capable browsers need only check for well-formedness in order to read XML documents. Thus the XML parser in an authoring tool will have to worry about checking for well-formedness and validity, but browsers will have to worry only about looking for well-formed XML. 
 
How do browsers read XML? 
A tool for reading XML documents is popularly called an XML parser, though the more formal name is an XML processor. XML processors pass data to an application for authoring, publishing, searching, or displaying. XML doesn't provide an application programming interface (API) to an application, it just passes data to it. No XML processor will parse data that isn't well-formed. Both Netscape and Microsoft either already include or are planning to include XML parsers in their browsers. 
The XML developer community makes available free XML readers and parsers for use in applications or XML authoring software: 
Textuality's Lark, from one of the co-editors of the XML specification. Microstar Software's Ælfred, a Java-based parser. DataChannel's DXP, formerly the well-known NXP, or Norbert's (Mikula) XML Parser, to which APIs have been added
 
How does RDF relate to XML? 
If XML is the ability to speak a language, XML applications are specific languages. Resource Description Format (RDF) is one such XML application: a data-modeling language using XML syntax. 
RDF is a way of describing and accessing data. That means RDF is data about data, or metadata. In the case of the Web, this metadata will be applied to creating standardized site maps, more precise search results, and hierarchical topic indices. 
RDF also allows for intelligent bookmarks that change as the Web pages being referenced change. This is very useful if you're tracking a site whose content is regularly updated. 
It won't be difficult for Web builders to create metadata regarding their Web site content that can be referenced by search engines. We'll soon have access to commercially available software that automatically produces an RDF file of a given site. 
XML metadata will also energize the market for companies whose business is to describe and rate information. There are many ratings bureaus springing up on the Web, bureaus that rate everything from kid-safe sites to the best movie or wine sites. RDF describes the syntax the ratings bureaus can use. People will choose the ratings bureau whose vocabulary they're most comfortable with--where vocabulary refers to the particular set of terms the bureau uses to rate different types of content--from sex and violence to wine acidity 
 
How Netscape implement XML in its browser? 
Netscape support XML metadata since Communicator/Navigator 5.0 as a delivery component called Aurora. Aurora takes advantage of RDF to achieve what Netscape calls "full information integration on the desktop." 
Aurora finds and manages information across networks, desktops, and databases. It will appear on the desktop as a "windowpane" menu interface that pulls together pointers to resources relating to current projects, research topics, or regular activities. RDF lets the Aurora navigation bar point to local files of varying data types (word processing documents, spreadsheet data, email messages, database content), as well as to resources on Internet or intranet servers (search and query results, bookmark links, and so on).  
An XML parser that reads RDF is already part of Netscape's browser. Beyond this initial RDF implementation, Netscape is planning to include a generalized XML parser in its browser that would work with other XML applications such as the Shakespeare markup (an early XML application), Chemical Markup Language (CML), and MathML, a mathematical markup language that is in the process of becoming a W3C Recommendation. 
"We want to turn Navigator into an XML platform," says R.V. Guha, Netscape principal engineer. Guha originally developed MCF (MetaContent Format), which has since been folded into the RDF specification.  
 
How does Microsoft implement XML in its browser? 
Microsoft's Internet Explorer 4.0 was the first Web browser to implement XML. Microsoft offers a pair of XML processors: a parser written in C++ that comes with the browser, and source code to a Java parser that Web builders can download and incorporate into their applications.  
The Java parser is a validating parser, meaning it checks against a Document Type Definition (DTD) or schema. To improve performance, the C++ version that comes with the browser is a nonvalidating parser. 
According to Steve Sklepowich, Microsoft's XML product manager, both parsers are "generalized" in the sense that they aren't dependent on specific XML applications such as the Channel Definition Format. Since XML data is separate from its presentation, the ability to actually display XML natively in a Web browser requires a style sheet, such as XSL (Extensible Style Language). 
In the meantime, Microsoft uses what it calls the XML Data Source Object, or XML DSO. This model uses the data-binding capability of Dynamic HTML to link XML data on the one hand to HTML on the other. 
Microsoft also uses the XML Object Model (XML OM) to let developers interact with XML data in the browser. It does this through a method of exposing HTML as objects based on the Document Object Model (DOM), though HTML and the DOM aren't directly compatible. The DOM lets scripts and programs access structured XML data. 
 
How do OSD and CDF relate to XML? 
Channel Definition Format (CDF) and Open Software Description (OSD) are XML applications championed by Microsoft. With its XML parser, Microsoft's Internet Explorer reads CDF files to drive and control collections of pages that come together in push channels. In light of work done with the Resource Definition Format (RDF), the CDF proposal was recently resubmitted to the W3C to take advantage of RDF's ability to show relationships between various data elements. 
Open Software Description is the vocabulary used to describe software components, tagging with syntax such as dependency, version, and platform. OSD describes how to advertise a component's properties and how to install that component onto a computer. OSD could be used to download a complete software package, but it's primarily designed for incremental updating. OSD works alone or with CDF to define application channels.  
OSD was submitted to the W3C in August of 1997 by a group of vendors led by Microsoft and Marimba 
 
What about e-commerce and XML? 
For four years, CommerceNet, the 500-member nonprofit Web-commerce consortium, has been trying to help e-commerce products and systems work together. The idea is to allow information to be exchanged from catalog to catalog, from catalog to payment system, and from payment system to payment system. It turns out that XML can help achieve this goal in two important ways: content definition and information exchange. 
Content definition: CommerceNet is working to define data elements common to a variety of commerce transactions. This so-called Commerce Core would define how to tag things like company name and address, price, item, and quantity. 
Information exchange: Open, text-based XML is ideal for exchanging transaction information from one server to another. CommerceNet proposes using the XML-based Common Business Language (CBL) to describe product- and service-catalog software, metadata about business rules and systems, and software for forms and messages. Much of the CBL is drawn from existing Electronic Data Interchange (EDI) dictionaries that identify agreed-upon terms such as invoices and purchase orders. But CBL goes beyond EDI's business-to-business focus to include retail transactions and the horizontal supply chain--from manufacturer to wholesaler to retailer. 
One such CBL application is the Product Information Exchange (PIX) specification for catalog interoperability. CommerceNet designed PIX to help manufacturers and their distributors exchange product data more easily. 
The long-term goal is for industry groups--not CommerceNet--to use CBL as a common basis for specific DTDs. Several industry-focused initiatives have already been announced: 
Open Buying on the Internet (OBI): A standard for international business-to-business purchasing of goods through the Internet. OBI is based on current Internet standards such as HTML, SSL (for security), SET (for credit-card transactions), and X.509 (for digital certificates). Among OBI's supporters are Commerce One, Connect, Intelisys, InterWorld, Microsoft, Netscape, Open Market, and Oracle. 
Open Trading Protocol (OTP): A consistent, interoperable environment for selling to consumers on the Web. Rules will range from how to offer items for sale to payment choices to product delivery, receipts, and problem resolution. OTP is backed by MasterCard International, DigiCash, CyberCash, Hewlett-Packard, IBM, AT&T Universal Card, Netscape, the Royal Bank of Canada, and a number of other financial institutions and technology companies. 
Internet Content and Exchange (ICE): Vignette and a number of other companies--including Microsoft--are developing a specification called ICE to enable the site-to-site exchange of online assets, whether those are content, applications, or metadata. ICE will leverage existing standards, including OPS/P3P (for trusted exchange of personal information), CDF, OSD, XML-Data , and RDF. (Note: CNET has a financial interest in Vignette.) 
 
What about style sheets in XML? 
Because XML separates content from presentation, Web builders need a new way to control design, display, and output issues. Style sheets are the answer. Currently, there are three types of style sheets that are candidates for use with XML:  
Cascading Style Sheets (CSS) Extensible Style Language (XSL) 
Document Style Semantics and Specification Language (DSSSL). 
XML's support for the existing CSS standard will handle most basic style and page-layout issues when it's supported in the 5.0 browsers. But CSS may not be robust enough to suit professional publishers. So, at the high end of the spectrum lies DSSSL (pronounced like thistle), an ISO (International Standards Organization) standard that's popular with high-end publishers using SGML. But DSSSL is complex and handles print-document management issues that are of little use on the Web.
That leaves Extensible Stylesheet Language (XSL), the style-sheet language written in XML expressly for XML. XSL, currently submitted as a W3C proposal, gives Web developers and users more presentation flexibility than HTML does. For example, HTML's <H2> tag is rendered in essentially the same way on all browsers. But XSL lets developers specify how their page elements should be rendered (though users can override those specifications in their personal settings). 
 
XSL is more powerful than CSS because XSL lets Web builders create documents that can alter their own appearance dynamically. You could, for example, include a programming-language statement such that "if an attribute of an XML element has a value of 10, render it in green, otherwise render it in black." Or you might mark a paragraph with "for internal use only" as an attribute, so it wouldn't show up at all in certain instances. XSL is designed to work with scripting languages such as JavaScript 
 
How does XML improve hyperlinking? 
XML hyperlinking goes beyond basic HTML-style hyperlinking with a number of new features, including the ability to create "smart" links without a lot of hand-coded JavaScript. And in XML, links become objects in their own right and can thus be managed like any other objects. 
The original linking specification--XLL, or XML Linking Language--is split into two separate specs: XPointer and XLink. 
XPointer: In HTML it's possible to link to the middle of a page only if the author of that page put an anchor tag there. With XPointer you'll be able to "address to" (not "link to") any part of someone else's text. It's easy to see how this ability would be useful in working with legal documents, scientific and academic papers, even W3C specifications!  
XLink: When a user clicks an HTML hyperlink, the current Web page is replaced by the file being linked to. XLink lets Web builders add behaviors to links. Today, for example, you have to use a bit of JavaScript to make a link pop up a separate window, but XLink lets Web builders code links to perform a variety of actions, including popping up a menu of linking choices. 
Another application of this technique might be to pop up a dialog box with a message. 
XML also lets Web builders create Extended Links that work sort of like a Web ring, which is a self-selected group of Web sites relating to the same topic that are navigated through a "next/previous" progression. For lists of related links too long for a pop-up menu, Web builders could create a linked list that changes from site to site and from page to page. Users could click an icon to automatically move to the next member of the ring. Today this capability would require CGI scripts, but Extended Links offers a standardized, nonproprietary method of creating relationships among resources.  
There are additional issues still left to work out, especially in the area of behavior policies. There has to be a way to negotiate between the behavior a document's author recommends for a link, a user's preferences in regard to displaying link information, and policies as to if and when the user's desires should be overridden. 
 
Who should learn XML? 
All Web builders need to know enough about XML to decide whether or not they should use it. E-commerce sites and sites that manage large numbers of documents stored in databases are obvious initial candidates. 
HTML is still more than adequate for marking up information if the ultimate goal is simply for it to be read by a human being. But if you want to prepare for automatic processing of data, you should think about incorporating XML into your publishing systems. 
Not every HTML producer working on every Web site has to become an XML producer, but someone on the staff of every company should become proficient--especially if the site works with data and documents worth managing for future use. 
Of course, XML's power also means complexity: some Web builders have found that while they can grasp the basics of HTML in a few days, they may have to spend a few weeks becoming comfortable with XML. Only you can decide if it's worth the effort 
 
What XML authoring tools can I use? 
Fortunately, Web builders won't be left on their own to create XML from scratch. Tools for creating, managing, and delivering XML are already on the market or in development by a number of companies. 
Adobe: In mid-1998, Adobe has introduced versions of FrameMaker and FrameMaker+SGML that can export to XML. The next full release of these products will be able to import XML. Adobe has a representative on the W3C's XML working group and is also involved with XLink, Cascading Style Sheets, and RDF, so it makes sense to expect these technologies to appear in future Adobe products. 
Allaire: HomeSite and Cold Fusion, since version 4.0 support XML, including style sheets. A CDF add-on is already available for HomeSite.  
DataChannel: A free, Java-based validating parser called DXP (DataChannel XML Parser; based on Norbert Mikula's well-known NXP) is available from the company's Web site. Also released is the free XML toolkit, XML Development Environment, which includes a set of components for people who want to get started with XML. 
Inso: This company offers what it calls "the first integrated, end-to-end publishing solution for creating, converting, storing, managing, indexing, searching, and publishing XML content to the Web, CD-ROM, and print." Products include DynaTag, DynaBase, DynaText, and companion tool DynaWeb.  
IntraNet Solutions: The next version of Intra.doc Management System, IntraNet Solution's Web-based document management system, it manages relationships between XML components and the document, provide integrated link management with third-party XML authoring tools, refine the use of XML objects in browsers, and offer an interactive metadata model between the Intra.doc repository and XML editors.  
Microsoft: MS Office have XML support. 
Microstar: ActiveSG/XML is a set of tools and techniques for design and deployment of XML/SGML transaction-based systems on the Internet. Microstar also offers the free Ælfred XML parser. 
SoftQuad (+Borland): HTML editor HotMetal Pro will soon offer Live Data Base Pages, an add-on that lets developers drag and drop HTML data into a database and have it returned as XML. 
Vignette: StoryServer 3.2 delivers XML-enabled applications and content on the Web. It combines tools for relational database, multimedia, and XML content creation. StoryServer is a Web-content application platform for building, managing, and delivering service-based Web applications, such as online publishing, knowledge management, and e-commerce systems. 
XPublish: XPublish is an XML publishing system for Web site development and management that permits a developer to author in XML or extend current HTML documents with XML constructs, then publish the site as HTML for access by any standard Web browser. A Cascading Style Sheets editor is included. 
WebMethods: The company makes XML-based Web Automation software, providing rapid integration with, and direct access to, Web data from within business applications. Its Web Interface Definition Language (WIDL) automates all interactions with HTML/XML documents and forms, providing a general method of representing request-response interactions over standard Web protocols. 
Of course, if XML becomes ubiquitous on the Web, you can expect nearly every type of Web-based application--especially HTML editors, database software, and e-commerce software--to quickly incorporate some level of XML support. 
 
What about internationalization? 
XML makes easier than ever before for Web builders to create truly international sites because, like Java, it's defined in Unicode (ISO 10646), an internationally accepted standard for depicting virtually all of the world's letters, glyphs, characters, and ideograms. Unicode includes the ASCII ISO Latin characters, as well as Japanese, Korean, Chinese, Hindi, Greek, and Arabic, among others. It even permits the mixing of character sets--for example, an XML document displayed in Japanese kanji could reference a German word.  
Developers don't have to learn any special script for Unicode to be in effect in XML documents, which will be displayed in users' browsers using the appropriate character set. 
 
What's the future of XML? 
With all the activity surrounding XML, it's difficult to predict where it will be in six months. Tim Bray, coauthor of the XML and XLL specifications, says, "We have produced a tool that's designed to be general purpose, and the broad range of people leaping on board is evidence that we've succeeded."  
In the short term, XML will probably surface first in metadata applications such as RDF. The next big impact will come with the approval of the Document Object Model specification. Bray claims that "the combination of XML and the DOM is really the magic bullet that will bring the Web alive." 
XML should also help jump-start electronic commerce. XML will let e-commerce vendors tag products and the information associated with them (price, size, color, features) in a common way, making it easy for customers to comparison shop across the Web.  
Meanwhile, Netscape and Microsoft can be counted on to continue expanding XML browser support to include both valid and well-formed XML documents, more XML applications, style sheet support designed for XML, and XML hyperlinking protocols. Watch both companies--as well as third-party software vendors--for XML authoring and publishing tool developments.